prometheus

mirror of https://github.com/prometheus/prometheus.git synced 2024-11-14 17:44:06 -08:00

Author	SHA1	Message	Date
Brian Brazil	2184b79763	Mark deleted rule's series as stale on next evaluation. (#5759 ) Fixes #5755 Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>	2019-08-07 16:11:05 +01:00
beorn7	dd81912554	Add objectives to Summaries With the next release of client_golang, Summaries will not have objectives by default. To not lose the objectives we have right now, explicitly state the current default objectives. Signed-off-by: beorn7 <beorn@grafana.com>	2019-06-12 02:03:13 +02:00
pbhudiaBAE	43953b105b	Sorting alerts by group name in /alerts (#5448 ) * Working group name Signed-off-by: Pritam Bhudia <pritam.bhudia@baesystems.com> * Working categorised by group name Signed-off-by: Pritam Bhudia <pritam.bhudia@baesystems.com> * Changed group sorting in web Signed-off-by: Pritam Bhudia <pritam.bhudia@baesystems.com> * Fixed group sorting and comments Signed-off-by: Pritam Bhudia <pritam.bhudia@baesystems.com> * Fixed group sorting and comments with gofmt Signed-off-by: Pritam Bhudia <pritam.bhudia@baesystems.com> * Added file and group name Signed-off-by: Pritam Bhudia <pritam.bhudia@baesystems.com> * reverted back to full path to yml file Signed-off-by: Pritam Bhudia <pritam.bhudia@baesystems.com>	2019-05-14 23:14:27 +02:00
Yao Zengzeng	5544cb252a	fix some mistakes in comments (#5533 ) Signed-off-by: YaoZengzeng <yaozengzeng@zju.edu.cn>	2019-05-05 10:48:42 +01:00
Simon Pasquier	45506841e6	*: enable all default linters (#5504 ) Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2019-05-03 15:11:28 +02:00
Bjoern Rabenstein	38d518c0fe	Rework #5009 after comments Signed-off-by: Bjoern Rabenstein <bjoern@rabenste.in>	2019-04-17 01:40:10 +02:00
Tariq Ibrahim	8fdfa8abea	refine error handling in prometheus (#5388 ) i) Uses the more idiomatic Wrap and Wrapf methods for creating nested errors. ii) Fixes some incorrect usages of fmt.Errorf where the error messages don't have any formatting directives. iii) Does away with the use of fmt package for errors in favour of pkg/errors Signed-off-by: tariqibrahim <tariq181290@gmail.com>	2019-03-26 00:01:12 +01:00
James Ravn	e15d8c5802	reload: copy state on both name and labels (#5368 ) * reload: copy state on both name and labels Fix https://github.com/prometheus/prometheus/issues/5193 Using just name causes the linked issue - if new rules are inserted with the same name (but different labels), the reordering will cause stale markers to be inserted in the next eval for all shifted rules, despite them not being stale. Ideally we want to avoid stale markers for time series that still exist in the new rules, with name and labels being the unique identifer. This change adds labels to the internal map when copying the old rule data to the new rule data. This prevents the problem of staling rules that simply shifted order. If labels change, it is a new time series and the old series will stale regardless. So it should be safe to always match on name and labels when copying state. Signed-off-by: James Ravn <james@r-vn.org>	2019-03-15 15:23:36 +00:00
David Symonds	46361a7c85	rules: Fix sorting of result from (*Manager).RuleGroups (#5260 ) The previous code was defective in that it never sorted groups within a file due to doing a multi-key sort incorrectly. Signed-off-by: David Symonds <dsymonds@gmail.com>	2019-02-23 09:51:44 +01:00
beorn7	2db1eeb4ec	Fix prometheus_rule_group_last_evaluation_timestamp_seconds It should be a unix timestamp, not the seconds in the minute. Signed-off-by: beorn7 <beorn@soundcloud.com>	2019-02-06 11:02:49 +01:00
Ganesh Vernekar	787eb1e904	Set rule_group_last_duration_seconds to seconds (#5153 ) Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>	2019-01-31 11:07:58 +01:00
Vishnunarayan K I	fd3ef6ba34	Add metric rule_group_rules_loaded to get the number of rules loaded (#5090 ) Signed-off-by: Vishnunarayan K I <appukuttancr@gmail.com>	2019-01-13 14:28:07 +00:00
Tom Wilkie	121603c417	Expose rules.NewGroupMetrics and rules.Metrics. (#5059 ) Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>	2019-01-03 12:07:06 +00:00
Bartek Płotka	de213d4a5e	rule manager: Moved metric registration to custom registerer which is already available. (#4961 ) Signed-off-by: Bartek Plotka <bwplotka@gmail.com>	2018-12-28 10:20:29 +00:00
mknapphrt	f0e9196dca	Return warnings on a remote read fail (#4832 ) Signed-off-by: Mark Knapp <mknapp@hudson-trading.com>	2018-11-30 14:27:12 +00:00
Krasi Georgiev	0754e5334b	querier for RestoreForState not closed. (#4922 ) Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>	2018-11-28 15:25:17 +02:00
Wei Guo	e329cbf673	Add metric prometheus_rule_group_last_evaluation for recording and alerting (#4852 ) * add metric prometheus_rule_group_last_evaluation for recording and alerting Signed-off-by: Wei Guo <me@imkira.com> * fix issues from comments Signed-off-by: Wei Guo <me@imkira.com>	2018-11-27 14:38:13 +08:00
Will Hegedus	193ebe7e34	Updates to /targets and /rules (scrape duration, last evaluation time) (#4722 ) * Add evaluationTimestamp (Last Evaluation) column to display on /rules Signed-off-by: Will Hegedus <wbhegedus@liberty.edu> * Add lastScrapeDuration ("Scrape Duration") to display on /targets Signed-off-by: Will Hegedus <wbhegedus@liberty.edu> * Updates based on Julius' feedback Signed-off-by: Will Hegedus <wbhegedus@liberty.edu> * Update to set timestamp to when eval started (after eval completes) Signed-off-by: Will Hegedus <wbhegedus@liberty.edu> * Update /rules to display time since last evaluation Signed-off-by: Will Hegedus <wbhegedus@liberty.edu> * Re-order Last Eval/Eval Time to be consistent with targets page Signed-off-by: Will Hegedus <wbhegedus@liberty.edu>	2018-10-12 18:26:59 +02:00
Ganesh Vernekar	5790d23fd8	Unit testing for rules (#4350 ) * Unit testing for rules * Specifying order of group evaluation in unit tests Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>	2018-09-25 17:06:26 +01:00
Chris Marchbanks	63ed9d1b70	Send EndsAt along with alerts (#4550 ) Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>	2018-08-28 16:05:00 +01:00
Chris Marchbanks	87f1dad16d	throttle resends of alerts to 1 minute by default (#4538 ) Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>	2018-08-27 17:41:42 +01:00
Julius Volz	8fbe1b5133	Handle a bunch of unchecked errors (#4461 ) There are many more (mostly finalizers like Close/Stop/etc.), but most of the others seemed like one couldn't do much about them anyway. Signed-off-by: Julius Volz <julius.volz@gmail.com>	2018-08-17 17:24:35 +02:00
Julien Pivotto	0b4d22b245	rules/manager: remove a no-longer-relevant comment (#4503 ) Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>	2018-08-15 09:33:39 +01:00
Benji Visser	8bb6e0dd6e	Show rule evaluation errors on rules page (#4457 ) * adding information about the health and errors for Rules adding Health() and LastError() to the Rule interface. This will allow us to easily surface information about rules. Signed-off-by: noqcks <benny@noqcks.io> * updating rules.html with fields for Rule errors and health state Signed-off-by: noqcks <benny@noqcks.io> * fix code comment grammar & access Rule health/error info using a mutex Signed-off-by: noqcks <benny@noqcks.io> * s/Errors/Error/ in rules.html to remain consistent with targets.html Signed-off-by: noqcks <benny@noqcks.io> * adding periods to code comments in reporting/alerting Signed-off-by: noqcks <benny@noqcks.io> * putting health/error below mutex in struct field Signed-off-by: noqcks <benny@noqcks.io>	2018-08-07 00:33:45 +02:00
Julius Volz	90521a65f8	Remove error return value from NotifyFunc() (#4459 ) It's always nil and we also forgot to check it. Signed-off-by: Julius Volz <julius.volz@gmail.com>	2018-08-04 21:31:12 +02:00
Ganesh Vernekar	f1db699dff	Persist alert 'for' state across restarts (#4061 ) Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>	2018-08-02 11:18:24 +01:00
Max Leonard Inden	71fafad099	api/v1: Coninue work exposing rules and alerts Signed-off-by: Max Leonard Inden <IndenML@gmail.com>	2018-07-30 15:31:51 +02:00
Bryan Boreham	afdb66dfac	Expose Group.CopyState() (#4304 ) This makes the `rules` package more useful to projects that use Prometheus as a library. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>	2018-07-18 15:14:38 +02:00
Julius Volz	9e3171f6e3	rules: Minor naming/comment cleanups (#4328 ) Signed-off-by: Julius Volz <julius.volz@gmail.com>	2018-07-18 04:54:33 +01:00
Alin Sinpalean	9dc763cc03	Run rule evaluation with timestamps precisely evaluation_interval apart (#4201 ) * Run rule evaluation with timestamps precisely evaluation_interval apart from one another. Signed-off-by: Alin Sinpalean <alin.sinpalean@gmail.com>	2018-06-01 15:23:07 +01:00
Mario Trangoni	464e747f1e	fix some comments typos (#4059 )	2018-04-08 10:51:54 +01:00
Bryan Boreham	93494d8b7e	Add an OpenTracing span for each rule (#4027 ) * Add an OpenTracing span for each rule So that tags and child spans can be traced back to the rule that they refer to.	2018-03-30 21:29:19 +01:00
ferhat elmas	ffa673f7d8	General simplifications (#3887 ) Another try as in #1516	2018-02-26 07:58:10 +00:00
Fabian Reinartz	7ccd4b39b8	*: implement query params This adds a parameter to the storage selection interface which allows query engine(s) to pass information about the operations surrounding a data selection. This can for example be used by remote storage backends to infer the correct downsampling aggregates that need to be provided.	2018-02-13 12:17:22 +01:00
Brian Brazil	30b4439bbd	Remove rule_type label from rule metrics. This is not really needed now that we have rule groups to distinguish rules.	2017-12-04 11:44:38 +00:00
Brian Brazil	b97f4cf48c	Add metrics for rule group interval and last duration.	2017-12-04 11:44:38 +00:00
Brian Brazil	0a42a9fc8f	Copy over rule group duration on reload. This is currently getting lost, this will soon be in a metric and we don't want it dropping to 0 on every reload.	2017-12-04 11:44:38 +00:00
Brian Brazil	aa370fa568	Clarify metric names around rule groups. Make it clear they're about overall rule groups.	2017-12-04 11:44:38 +00:00
Fabian Reinartz	62461379b7	rules: decouple notifier packages The dependency on the notifier packages caused a transitive dependency on discovery and with that all client libraries our service discovery uses.	2017-11-27 16:38:14 +01:00
Fabian Reinartz	4d964a0a0d	rules: make glob expansion a concern of main	2017-11-24 08:22:57 +01:00
Fabian Reinartz	bd9f7460eb	rules: remove config package dependency	2017-11-24 07:57:54 +01:00
Fabian Reinartz	2d0e3746ac	rules: remove dependency on promql.Engine	2017-11-24 07:57:54 +01:00
Goutham Veeramachaneni	a880c86375	Fix unexported method on exported interface. Also move to model.Duration Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>	2017-11-23 19:13:57 +05:30
conorbroderick	55aaece116	Add rule evaluation time	2017-11-22 15:22:02 +00:00
Goutham Veeramachaneni	e1117715fe	rules: remove skipped iterations cuz no throttling Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>	2017-11-14 17:33:00 +05:30
Krasi Georgiev	e86d82ad2d	Fix regression of alert rules state loss on config reload. (#3382 ) * incorrect map name for the group prevented copying state from existing alert rules on config reload * applyConfig test * few nits * nits 2	2017-11-01 12:58:00 +01:00
Julius Volz	099df0c5f0	Migrate "golang.org/x/net/context" -> "context" (#3333 ) In some places, where ctxhttp or gRPC are concerned, we still need to use the old contexts.	2017-10-24 21:21:42 -07:00
Brian Brazil	ee88f0d222	Ensure all values are used or _	2017-10-09 19:44:03 +01:00
Fabian Reinartz	2d0b8e8b94	Merge branch 'master' into dev-2.0	2017-10-05 13:09:18 +02:00
beorn7	c2e9a151ab	Make all rule links link to the "Console" tab rather than "Graph" Clicking on a rule, either the name or the expression, opens the rule result (or the corresponding expression, repsectively) in the expression browser. This should by default happen in the console tab, as, more often than not, displaying it in the graph tab runs into a timeout.	2017-09-21 18:28:00 +02:00
Fabian Reinartz	d21f149745	*: migrate to go-kit/log	2017-09-08 22:01:51 +05:30
Goutham Veeramachaneni	37e7b69f56	Merge remote-tracking branch 'upstream/dev-2.0' into rulegroups Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>	2017-06-19 16:34:55 +05:30
Goutham Veeramachaneni	c472316fb3	Check done before every rule evaluation. Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>	2017-06-16 16:57:22 +05:30
Goutham Veeramachaneni	6b70a4d850	Incorporate PR feedback * Move fingerprint to Hash() * Move away from tsdb.MultiError * 0777 -> 0666 for files * checkOverflow of extra fields Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>	2017-06-16 16:44:33 +05:30
Goutham Veeramachaneni	507790a357	Rework logging to use explicitly passed logger Mostly cleaned up the global logger use. Still some uses in discovery package. Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>	2017-06-16 15:52:44 +05:30
Goutham Veeramachaneni	dc69645e92	Move back to go-yaml Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>	2017-06-16 10:46:21 +05:30
Goutham Veeramachaneni	5ff283a7b7	Reflect the grouping in the UI Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>	2017-06-14 16:09:14 +05:30
Goutham Veeramachaneni	8cca666cf2	Add file name to group. Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>	2017-06-14 15:18:39 +05:30
Goutham Veeramachaneni	e893c89333	Validate labels and annotations Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>	2017-06-14 15:07:54 +05:30
Goutham Veeramachaneni	a48a018368	Make sure groups are unique in a single file Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>	2017-06-14 12:19:21 +05:30
Goutham Veeramachaneni	cea1e99f78	Add update-rules command to promtool Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>	2017-06-14 11:38:54 +05:30
Goutham Veeramachaneni	e8f55669ea	Move rules to new format Signed-off-by: Goutham Veeramachaneni <goutham@boomerangcommerce.com>	2017-06-12 18:14:39 +05:30
Brian Brazil	cc867dae60	Copy previous series and alert state more intelligently. Usually rules don't more around, and if they do it's likely that rules/alerts with the same name stay in the same order. If rules/alerts with the same name are added/removed this could cause a blip for one cycle, but this is unavoidable without requiring rule and alert names to be unique - which we don't want to do.	2017-05-24 13:52:45 +01:00
Brian Brazil	9bc68db7e6	Track staleness per rule rather than per group.	2017-05-24 13:52:45 +01:00
Brian Brazil	0451d6d31b	Add unittest for rule staleness, and rules generally.	2017-05-24 13:52:45 +01:00
Brian Brazil	0400f3cfd2	Very basic staleness handling for rules.	2017-05-24 13:52:45 +01:00
Fabian Reinartz	06c2b76cd4	Merge branch 'master' into uptsdb	2017-05-16 16:48:37 +02:00
Julius Volz	ac203ef0ee	Add externalURL template function (#2716 ) This allows users to e.g. add links back to the generating Prometheus right in their alert templates.	2017-05-13 15:47:04 +02:00
Fabian Reinartz	8ffc851147	Merge branch 'master' into dev-2.0	2017-04-04 15:17:56 +02:00
Tobias Schmidt	eaf33759fb	Register forgotten prometheus_evaluator_iterations_total metric	2017-04-02 20:32:56 -03:00
Tobias Schmidt	aaaba57184	Export number of missed rule evaluations In case the execution of all rules takes longer than the configured rule evaluation interval, one or more iterations will be skipped. This needs to be visible to the opterator.	2017-04-02 20:03:28 -03:00
Fabian Reinartz	5772f1a7ba	retrieval/storage: adapt to new interface This simplifies the interface to two add methods for appends with labels or faster reference numbers.	2017-02-02 13:05:46 +01:00
Fabian Reinartz	ad9bc62e4c	storage: extend appender and adapt it	2017-01-13 14:48:01 +01:00
Fabian Reinartz	e94b0899ee	rules: fix tests, remove model types	2016-12-29 17:31:14 +01:00
Fabian Reinartz	f8fc1f5bb2	*: migrate ingestion to new batch Appender	2016-12-29 11:03:56 +01:00
Fabian Reinartz	5817cb5bde	: migrate from model. to promql.* types	2016-12-25 00:37:46 +01:00
Jonathan Lange	d78dd3593d	Set evaluation interval on Group construction Prevents having object in invalid state, and allows users of public API to construct valid Groups.	2016-11-18 16:32:30 +00:00
Jonathan Lange	31fc357cd8	Make NewGroup and Group.Eval public Allows callers to execute evaluate lists of rules without first writing them to disk.	2016-11-18 16:25:58 +00:00
Jonathan Lange	2a2da40223	Make rule evaluation publicly available Means that a third-party can parse rules and run them with their own execution model.	2016-11-18 16:12:50 +00:00
Matt Bostock	926a5ab3dd	rules/manager.go: Fix race between reload and stop On one relatively large Prometheus instance (1.7M series), I noticed that upgrades were frequently resulting in Prometheus undergoing crash recovery on start-up. On closer examination, I found that Prometheus was panicking on shutdown. It seems that our configuration management (or misconfiguration thereof) is reloading Prometheus then immediately restarting it, which I suspect is causing this race: Sep 21 15:12:42 host systemd[1]: Reloading prometheus monitoring system. Sep 21 15:12:42 host prometheus[18734]: time="2016-09-21T15:12:42Z" level=info msg="Loading configuration file /etc/prometheus/config.yaml" source="main.go:221" Sep 21 15:12:42 host systemd[1]: Reloaded prometheus monitoring system. Sep 21 15:12:44 host systemd[1]: Stopping prometheus monitoring system... Sep 21 15:12:44 host prometheus[18734]: time="2016-09-21T15:12:44Z" level=warning msg="Received SIGTERM, exiting gracefully..." source="main.go:203" Sep 21 15:12:44 host prometheus[18734]: time="2016-09-21T15:12:44Z" level=info msg="See you next time!" source="main.go:210" Sep 21 15:12:44 host prometheus[18734]: time="2016-09-21T15:12:44Z" level=info msg="Stopping target manager..." source="targetmanager.go:90" Sep 21 15:12:52 host prometheus[18734]: time="2016-09-21T15:12:52Z" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:548" Sep 21 15:12:56 host prometheus[18734]: time="2016-09-21T15:12:56Z" level=warning msg="Error on ingesting out-of-order samples" numDropped=1 source="scrape.go:467" Sep 21 15:12:56 host prometheus[18734]: time="2016-09-21T15:12:56Z" level=error msg="Error adding file watch for \"/etc/prometheus/targets\": no such file or directory" source="file.go:84" Sep 21 15:12:56 host prometheus[18734]: time="2016-09-21T15:12:56Z" level=error msg="Error adding file watch for \"/etc/prometheus/targets\": no such file or directory" source="file.go:84" Sep 21 15:13:01 host prometheus[18734]: time="2016-09-21T15:13:01Z" level=info msg="Stopping rule manager..." source="manager.go:366" Sep 21 15:13:01 host prometheus[18734]: time="2016-09-21T15:13:01Z" level=info msg="Rule manager stopped." source="manager.go:372" Sep 21 15:13:01 host prometheus[18734]: time="2016-09-21T15:13:01Z" level=info msg="Stopping notification handler..." source="notifier.go:325" Sep 21 15:13:01 host prometheus[18734]: time="2016-09-21T15:13:01Z" level=info msg="Stopping local storage..." source="storage.go:381" Sep 21 15:13:01 host prometheus[18734]: time="2016-09-21T15:13:01Z" level=info msg="Stopping maintenance loop..." source="storage.go:383" Sep 21 15:13:01 host prometheus[18734]: panic: close of closed channel Sep 21 15:13:01 host prometheus[18734]: goroutine 7686074 [running]: Sep 21 15:13:01 host prometheus[18734]: panic(0xba57a0, 0xc60c42b500) Sep 21 15:13:01 host prometheus[18734]: /usr/local/go/src/runtime/panic.go:500 +0x1a1 Sep 21 15:13:01 host prometheus[18734]: github.com/prometheus/prometheus/rules.(Manager).ApplyConfig.func1(0xc6645a9901, 0xc420271ef0, 0xc420338ed0, 0xc60c42b4f0, 0xc6645a9900) Sep 21 15:13:01 host prometheus[18734]: /home/build/packages/prometheus/tmp/build/gopath/src/github.com/prometheus/prometheus/rules/manager.go:412 +0x3c Sep 21 15:13:01 host prometheus[18734]: created by github.com/prometheus/prometheus/rules.(Manager).ApplyConfig Sep 21 15:13:01 host prometheus[18734]: /home/build/packages/prometheus/tmp/build/gopath/src/github.com/prometheus/prometheus/rules/manager.go:423 +0x56b Sep 21 15:13:03 host systemd[1]: prometheus.service: main process exited, code=exited, status=2/INVALIDARGUMENT	2016-09-21 22:03:02 +01:00
Julius Volz	c187308366	storage: Contextify storage interfaces. This is based on https://github.com/prometheus/prometheus/pull/1997. This adds contexts to the relevant Storage methods and already passes PromQL's new per-query context into the storage's query methods. The immediate motivation supporting multi-tenancy in Frankenstein, but this could also be used by Prometheus's normal local storage to support cancellations and timeouts at some point.	2016-09-19 16:29:07 +02:00
Julius Volz	ed5a0f0abe	promql: Allow per-query contexts. For Weaveworks' Frankenstein, we need to support multitenancy. In Frankenstein, we initially solved this without modifying the promql package at all: we constructed a new promql.Engine for every query and injected a storage implementation into that engine which would be primed to only collect data for a given user. This is problematic to upstream, however. Prometheus assumes that there is only one engine: the query concurrency gate is part of the engine, and the engine contains one central cancellable context to shut down all queries. Also, creating a new engine for every query seems like overkill. Thus, we want to be able to pass per-query contexts into a single engine. This change gets rid of the promql.Engine's built-in base context and allows passing in a per-query context instead. Central cancellation of all queries is still possible by deriving all passed-in contexts from one central one, but this is now the responsibility of the caller. The central query context is now created in main() and passed into the relevant components (web handler / API, rule manager). In a next step, the per-query context would have to be passed to the storage implementation, so that the storage can implement multi-tenancy or other features based on the contextual information.	2016-09-19 15:38:17 +02:00
Dmitry Vorobev	273e457da4	web: return status code and error message for config resource	2016-07-15 10:15:24 +02:00
Brian Brazil	0509b0f2db	Expand alert templates at eval time. Fixes #1678 #1677	2016-07-12 17:13:55 +01:00
beorn7	064b57858e	Consistently use the `Seconds()` method for conversion of durations This also fixes one remaining case of recording integral numbers of seconds only for a metric, i.e. this will probably fix #1796.	2016-07-07 15:24:35 +02:00
beorn7	45e5775f9b	Add missing logging of out-of-order samples So far, out-of-order samples during rule evaluation were not logged, and neither scrape health samples. The latter are unlikely to cause any errors. That's why I'm logging them always now. (It's alway highly irregular should it happen.) For rules, I have used the same plumbing as for samples, just with a different wording in the message to mark them as a result of rule evaluation.	2016-05-19 16:22:53 +02:00
Fabian Reinartz	d89c254849	Make copying alerting state safer. This considers static labels in the equality of alerts to avoid falsely copying state from a different alert definition with the same name across reloads. To be safe, it also copies the state map rather than just its pointer so that remaining collisions disappear after one evaluation interval.	2016-03-02 12:21:54 +01:00
Fabian Reinartz	bfa8aaa017	Rename notification to notifier	2016-03-01 12:39:08 +01:00
beorn7	663a1550d0	Fix the instrumentation fixes	2016-02-17 15:50:55 +01:00
beorn7	ec08c9a391	Rework the way to communicate backpressure (AKA suspended ingestion) This gives up on the idea to communicate throuh the Append() call (by either not returning as it is now or returning an error as suggested/explored elsewhere). Here I have added a Throttled() call, which has the advantage that it can be called before a whole _batch_ of Append()'s. Scrapes will happen completely or not at all. Same for rule group evaluations. That's a highly desired behavior (as discussed elsewhere). The code is even simpler now as the whole ingestion buffer could be removed. Logging of throttled mode has been streamlined and will create at most one message per minute.	2016-02-01 14:45:44 +01:00
Fabian Reinartz	b0adfea8d5	Fix swapped constants, improve instrumentation	2016-01-21 12:15:29 +01:00
Fabian Reinartz	a8c38c3ac5	Don't log rule evaluation failure on shutdown	2016-01-18 17:34:25 +01:00
Fabian Reinartz	6eee86dce8	Terminate rule groups during initial sleep When an evaluation group runs initially, it waits a deterministic amount of time. During that time it also has to accept a termination singnal so shutdown doesn't hang during the first evaluation iteration after a configuration reload. Fixes #1307	2016-01-12 10:54:09 +01:00
Fabian Reinartz	37d80c4b25	Fix premature rule evaluation This commit prevents rule evaluation from starting until after the storage is ready.	2016-01-08 17:51:22 +01:00
Fabian Reinartz	0cf3c6a9ef	Add comments, rename a method	2015-12-23 12:29:28 +01:00
Fabian Reinartz	bf6abac8f4	Send resolved notifications	2015-12-17 15:42:26 +01:00
Fabian Reinartz	f69e668fc4	Improve rules/ instrumentation This commit adds a counter for the total number of rule evaluations and standardizes the units to seconds.	2015-12-17 15:42:26 +01:00
Fabian Reinartz	52e5224f5a	Refactor rules/ package	2015-12-17 15:42:25 +01:00
Fabian Reinartz	e4fabe135a	Set StartsAt to time of first firing state	2015-12-17 11:36:58 +01:00
Fabian Reinartz	7c90db22ed	Use annotation based alerts in rules/ This commit breaks the previously used alert format.	2015-12-14 10:16:07 +01:00
Fabian Reinartz	e114ce0ff7	Refactor notification handler	2015-12-11 15:17:32 +01:00
Fabian Reinartz	e3b6ec9784	Switch to common/log	2015-10-03 10:21:43 +02:00
Julius Volz	995d3b831d	Fix most golint warnings. This is with `golint -min_confidence=0.5`. I left several lint warnings untouched because they were either incorrect or I felt it was better not to change them at the moment.	2015-08-26 12:44:46 +02:00
Fabian Reinartz	d6b8da8d43	Switch promql types to common/model	2015-08-25 13:49:14 +02:00
Brian Brazil	fdf0d0642e	Cast value to float, as that's what the console templates expect.	2015-08-24 16:59:08 +01:00
Fabian Reinartz	438e232c9b	Fix grouping of import blocks	2015-08-22 09:42:45 +02:00
Fabian Reinartz	306e8468a0	Switch from client_golang/model to common/model	2015-08-21 13:33:38 +02:00
Fabian Reinartz	7a67472fc1	Resolve relative paths on configuration loading This moves the concern of resolving the files relative to the config file into the configuration loading itself. It also fixes #921 which did not load the cert and token files relatively.	2015-08-05 18:08:04 +02:00
Fabian Reinartz	feb8a03503	rules: load rule files relative to a base dir	2015-07-03 15:10:37 +02:00
Julius Volz	fcff35b43e	Consolidate external reachability flags into one. Besides fixing https://github.com/prometheus/prometheus/issues/805 by making the entire externally reachable server URL configurable, this adds tests for the "globalURL" template function and makes it easier to test other such functions in the future. This breaks the `web.Hostname` flag (and introduces `web.external-url`). This flag is likely only used by few users, so I hope that's justifiable. Fixes https://github.com/prometheus/prometheus/issues/805	2015-07-03 13:39:10 +02:00
Fabian Reinartz	f06cf664e1	rules: cleanup alerting test	2015-06-30 18:22:24 +02:00
Fabian Reinartz	9bd4f6d017	rules: preserve alert state across reloads.	2015-06-30 11:32:07 +02:00
Fabian Reinartz	4625485b84	rules: move rules.go contents to manager.go	2015-06-30 11:32:07 +02:00
Fabian Reinartz	749ae450c5	promql: add runbook to alert statement. This commit adds the RUNBOOK keyword to alert statements. The field is optional and expected to be a link.	2015-06-25 13:00:52 +02:00
Fabian Reinartz	5e13880201	General cleanup of rules.	2015-06-06 21:40:52 +02:00
Fabian Reinartz	280d11dca8	main: exit on invalid rule files on startup.	2015-06-02 18:44:41 +02:00
Fabian Reinartz	0de6edbdfc	Move pkg/ to util/	2015-06-01 21:12:32 +02:00
Fabian Reinartz	dbc0d30e3e	Move string functionality to pkg/strutil	2015-06-01 21:12:32 +02:00
Fabian Reinartz	f45a5cab60	Move templates package to pkg/template	2015-06-01 21:12:31 +02:00
Fabian Reinartz	c44ac7bc26	Load rule files from entire directories	2015-06-01 21:12:31 +02:00
Julius Volz	ff53d10849	Fix double slash in GeneratorURL sent to alertmanager. Fixes https://github.com/prometheus/prometheus/issues/722	2015-05-23 19:16:57 +02:00
Julius Volz	267fd34156	Switch Prometheus to use github.com/prometheus/log. This change is conceptually very simple, although the diff is large. It switches logging from "github.com/golang/glog" to "github.com/prometheus/log", while not actually changing any log messages. V(1)-style logging has been changed to be log.Debug*().	2015-05-20 18:19:32 +02:00
Fabian Reinartz	bb540fd9fd	Implement config reloading on SIGHUP. With this commit, sending SIGHUP to the Prometheus process will reload and apply the configuration file. The different components attempt to handle failing changes gracefully.	2015-05-13 16:49:46 +02:00
Fabian Reinartz	fe935179cd	Stop routing rule statements through the engine.	2015-04-29 18:01:43 +02:00
Fabian Reinartz	479891c9be	Rename RuleManager to Manager, remove interface. This commits renames the RuleManager to Manager as the package name is 'rules' now. The unused layer of abstraction of the RuleManager interface is removed.	2015-04-29 16:42:10 +02:00
Fabian Reinartz	3ca11bcaf5	Switch Prometheus to promql package. This commit removes all functionality from rules/ that is now handled in promql/. All parts of Prometheus are changed to use the promql/ package.	2015-04-28 16:19:23 +02:00
Brian Brazil	e041c0cd46	Add console and alert templates with access to all data. Move rulemanager to it's own package to break cicrular dependency. Make NewTestTieredStorage available to tests, remove duplication. Change-Id: I33b321245a44aa727bfc3614a7c9ae5005b34e03	2014-05-30 16:24:56 +01:00
Julius Volz	01f652cb4c	Separate storage implementation from interfaces. This was initially motivated by wanting to distribute the rule checker tool under `tools/rule_checker`. However, this was not possible without also distributing the LevelDB dynamic libraries because the tool transitively depended on Levigo: rule checker -> query layer -> tiered storage layer -> leveldb This change separates external storage interfaces from the implementation (tiered storage, leveldb storage, memory storage) by putting them into separate packages: - storage/metric: public, implementation-agnostic interfaces - storage/metric/tiered: tiered storage implementation, including memory and LevelDB storage. I initially also considered splitting up the implementation into separate packages for tiered storage, memory storage, and LevelDB storage, but these are currently so intertwined that it would be another major project in itself. The query layers and most other parts of Prometheus now have notion of the storage implementation anymore and just use whatever implementation they get passed in via interfaces. The rule_checker is now a static binary :) Change-Id: I793bbf631a8648ca31790e7e772ecf9c2b92f7a0	2014-04-16 13:30:19 +02:00
Julius Volz	20bfaf80ab	Merge "Display filename when encountering bad rule file."	2013-12-13 15:01:02 +01:00
Julius Volz	3bf3a555b2	Merge "add evalDuration histogram and ruleCount counter for rules"	2013-12-11 22:52:19 +01:00
Stuart Nelson	b75adfebad	add evalDuration histogram and ruleCount counter for rules Change-Id: I3508fe72526348d96b8158828388c3ac8d7c3fa9	2013-12-11 15:42:53 -05:00
Julius Volz	77a79d1fc0	Display filename when encountering bad rule file. Change-Id: I4729371be92c5659a6938145c5fde66771d7be22	2013-12-11 15:44:11 +01:00
Julius Volz	fb44580110	Cleanup/fix program termination sequence. Change-Id: I2bc58a2583fb079c9ef383cfc7a5e0fbe613f1cd	2013-12-11 15:40:32 +01:00
Julius Volz	740d448983	Use custom timestamp type for sample timestamps and related code. So far we've been using Go's native time.Time for anything related to sample timestamps. Since the range of time.Time is much bigger than what we need, this has created two problems: - there could be time.Time values which were out of the range/precision of the time type that we persist to disk, therefore causing incorrectly ordered keys. One bug caused by this was: https://github.com/prometheus/prometheus/issues/367 It would be good to use a timestamp type that's more closely aligned with what the underlying storage supports. - sizeof(time.Time) is 192, while Prometheus should be ok with a single 64-bit Unix timestamp (possibly even a 32-bit one). Since we store samples in large numbers, this seriously affects memory usage. Furthermore, copying/working with the data will be faster if it's smaller. MEMORY USAGE RESULTS Initial memory usage comparisons for a running Prometheus with 1 timeseries and 100,000 samples show roughly a 13% decrease in total (VIRT) memory usage. In my tests, this advantage for some reason decreased a bit the more samples the timeseries had (to 5-7% for millions of samples). This I can't fully explain, but perhaps garbage collection issues were involved. WHEN TO USE THE NEW TIMESTAMP TYPE The new clientmodel.Timestamp type should be used whenever time calculations are either directly or indirectly related to sample timestamps. For example: - the timestamp of a sample itself - all kinds of watermarks - anything that may become or is compared to a sample timestamp (like the timestamp passed into Target.Scrape()). When to still use time.Time: - for measuring durations/times not related to sample timestamps, like duration telemetry exporting, timers that indicate how frequently to execute some action, etc. NOTE ON OPERATOR OPTIMIZATION TESTS We don't use operator optimization code anymore, but it still lives in the code as dead code. It still has tests, but I couldn't get all of them to pass with the new timestamp format. I commented out the failing cases for now, but we should probably remove the dead code soon. I just didn't want to do that in the same change as this. Change-Id: I821787414b0debe85c9fffaeb57abd453727af0f	2013-12-03 09:11:28 +01:00
Julius Volz	1eb1ceac8c	Add alert-expression console links to notifications. The ConsoleLinkForExpression() function now escapes console URLs in such a way that works both in emails and in HTML. Change-Id: I917bae0b526cbbac28ccd2a4ec3c5ac03ee4c647	2013-08-20 15:45:41 +02:00
Julius Volz	aa5d251f8d	Use github.com/golang/glog for all logging.	2013-08-12 17:54:36 +02:00
Julius Volz	3b970c5133	Add variable interpolation to notification messages. This includes required refactorings to enable replacing the http client (for testing) and moving the NotificationReq type definitions to the "notifications" package, so that this package doesn't need to depend on "rules" anymore and that it can instead use a representation of the required data which only includes the necessary fields.	2013-08-12 12:29:08 +02:00
Julius Volz	35ee2cd3cb	Add alertmanager notification support to Prometheus. Alert definitions now also have mandatory SUMMARY and DESCRIPTION fields that get sent along a firing alert to the alert manager.	2013-07-30 17:23:41 +02:00
Matt T. Proud	30b1cf80b5	WIP - Snapshot of Moving to Client Model.	2013-06-25 15:52:42 +02:00
Julius Volz	0226d1ac7a	Implement alerts dashboard and expression console links.	2013-06-13 22:35:40 +02:00
Julius Volz	ba29d07901	Show loaded rules in Status dashboard.	2013-06-11 11:39:31 +02:00
Julius Volz	adb87816f4	Put RuleManager concurrency in hands of caller, fix races.	2013-06-05 13:56:56 +02:00
Matt T. Proud	c10780c966	Introduce telemetry for rule evaluator durations. This commit adds telemetry for the Prometheus expression rule evaluator, which will enable meta-Prometheus monitoring of customers to ensure that no instance is falling behind in answering routine queries. A few other sundry simplifications are introduced, too.	2013-05-23 21:29:27 +02:00
Julius Volz	56324d8ce2	Make AST query storage non-global.	2013-05-07 13:15:10 +02:00
Julius Volz	9cea5d9df8	Convert the Prometheus configuration to protocol buffers.	2013-04-30 22:26:00 +02:00
Julius Volz	d8110fcd9c	Send sample arrays instead of single samples over channels.	2013-04-29 17:24:17 +02:00
Julius Volz	2202cd71c9	Track alerts over time and write out alert timeseries.	2013-04-26 14:35:21 +02:00
Julius Volz	c0601abf46	Implement initial no-op alert parsing and rule parsing tests.	2013-04-23 13:48:24 +02:00
Julius Volz	1eb586db7d	Fix rule evaluation closure.	2013-04-17 15:11:21 +02:00
Julius Volz	c4d0969c00	Propagate more errors during rule evaluation.	2013-04-09 13:47:20 +02:00

1 2 3 4 5 ...

253 commits