Commit graph

249 commits

Author SHA1 Message Date
Simon Pasquier 45506841e6
*: enable all default linters (#5504)
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2019-05-03 15:11:28 +02:00
Bjoern Rabenstein 38d518c0fe Rework #5009 after comments
Signed-off-by: Bjoern Rabenstein <bjoern@rabenste.in>
2019-04-17 01:40:10 +02:00
Tariq Ibrahim 8fdfa8abea refine error handling in prometheus (#5388)
i) Uses the more idiomatic Wrap and Wrapf methods for creating nested errors.
ii) Fixes some incorrect usages of fmt.Errorf where the error messages don't have any formatting directives.
iii) Does away with the use of fmt package for errors in favour of pkg/errors

Signed-off-by: tariqibrahim <tariq181290@gmail.com>
2019-03-26 00:01:12 +01:00
James Ravn e15d8c5802 reload: copy state on both name and labels (#5368)
* reload: copy state on both name and labels

Fix https://github.com/prometheus/prometheus/issues/5193

Using just name causes the linked issue - if new rules are inserted with
the same name (but different labels), the reordering will cause stale
markers to be inserted in the next eval for all shifted rules, despite
them not being stale.

Ideally we want to avoid stale markers for time series that still exist
in the new rules, with name and labels being the unique identifer.

This change adds labels to the internal map when copying the old rule
data to the new rule data. This prevents the problem of staling rules
that simply shifted order.

If labels change, it is a new time series and the old series will stale
regardless. So it should be safe to always match on name and labels when
copying state.

Signed-off-by: James Ravn <james@r-vn.org>
2019-03-15 15:23:36 +00:00
David Symonds 46361a7c85 rules: Fix sorting of result from (*Manager).RuleGroups (#5260)
The previous code was defective in that it never sorted groups within a
file due to doing a multi-key sort incorrectly.

Signed-off-by: David Symonds <dsymonds@gmail.com>
2019-02-23 09:51:44 +01:00
beorn7 2db1eeb4ec Fix prometheus_rule_group_last_evaluation_timestamp_seconds
It should be a unix timestamp, not the seconds in the minute.

Signed-off-by: beorn7 <beorn@soundcloud.com>
2019-02-06 11:02:49 +01:00
Ganesh Vernekar 787eb1e904 Set rule_group_last_duration_seconds to seconds (#5153)
Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
2019-01-31 11:07:58 +01:00
Vishnunarayan K I fd3ef6ba34 Add metric rule_group_rules_loaded to get the number of rules loaded (#5090)
Signed-off-by: Vishnunarayan K I <appukuttancr@gmail.com>
2019-01-13 14:28:07 +00:00
Tom Wilkie 121603c417
Expose rules.NewGroupMetrics and rules.Metrics. (#5059)
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
2019-01-03 12:07:06 +00:00
Bartek Płotka de213d4a5e rule manager: Moved metric registration to custom registerer which is already available. (#4961)
Signed-off-by: Bartek Plotka <bwplotka@gmail.com>
2018-12-28 10:20:29 +00:00
mknapphrt f0e9196dca Return warnings on a remote read fail (#4832)
Signed-off-by: Mark Knapp <mknapp@hudson-trading.com>
2018-11-30 14:27:12 +00:00
Krasi Georgiev 0754e5334b
querier for RestoreForState not closed. (#4922)
Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>
2018-11-28 15:25:17 +02:00
Wei Guo e329cbf673 Add metric prometheus_rule_group_last_evaluation for recording and alerting (#4852)
* add metric prometheus_rule_group_last_evaluation for recording and alerting

Signed-off-by: Wei Guo <me@imkira.com>

* fix issues from comments

Signed-off-by: Wei Guo <me@imkira.com>
2018-11-27 14:38:13 +08:00
Will Hegedus 193ebe7e34 Updates to /targets and /rules (scrape duration, last evaluation time) (#4722)
* Add evaluationTimestamp (Last Evaluation) column to display on /rules
Signed-off-by: Will Hegedus <wbhegedus@liberty.edu>

* Add lastScrapeDuration ("Scrape Duration") to display on /targets
Signed-off-by: Will Hegedus <wbhegedus@liberty.edu>

* Updates based on Julius' feedback

Signed-off-by: Will Hegedus <wbhegedus@liberty.edu>

* Update to set timestamp to when eval started (after eval completes)

Signed-off-by: Will Hegedus <wbhegedus@liberty.edu>

* Update /rules to display time since last evaluation

Signed-off-by: Will Hegedus <wbhegedus@liberty.edu>

* Re-order Last Eval/Eval Time to be consistent with targets page

Signed-off-by: Will Hegedus <wbhegedus@liberty.edu>
2018-10-12 18:26:59 +02:00
Ganesh Vernekar 5790d23fd8 Unit testing for rules (#4350)
* Unit testing for rules
* Specifying order of group evaluation in unit tests

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
2018-09-25 17:06:26 +01:00
Chris Marchbanks 63ed9d1b70 Send EndsAt along with alerts (#4550)
Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2018-08-28 16:05:00 +01:00
Chris Marchbanks 87f1dad16d throttle resends of alerts to 1 minute by default (#4538)
Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2018-08-27 17:41:42 +01:00
Julius Volz 8fbe1b5133
Handle a bunch of unchecked errors (#4461)
There are many more (mostly finalizers like Close/Stop/etc.), but most of
the others seemed like one couldn't do much about them anyway.

Signed-off-by: Julius Volz <julius.volz@gmail.com>
2018-08-17 17:24:35 +02:00
Julien Pivotto 0b4d22b245 rules/manager: remove a no-longer-relevant comment (#4503)
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2018-08-15 09:33:39 +01:00
Benji Visser 8bb6e0dd6e Show rule evaluation errors on rules page (#4457)
* adding information about the health and errors for Rules

adding Health() and LastError() to the Rule interface. This will allow
us to easily surface information about rules.

Signed-off-by: noqcks <benny@noqcks.io>

* updating rules.html with fields for Rule errors and health state

Signed-off-by: noqcks <benny@noqcks.io>

* fix code comment grammar & access Rule health/error info using a mutex

Signed-off-by: noqcks <benny@noqcks.io>

* s/Errors/Error/ in rules.html to remain consistent with targets.html

Signed-off-by: noqcks <benny@noqcks.io>

* adding periods to code comments in reporting/alerting

Signed-off-by: noqcks <benny@noqcks.io>

* putting health/error below mutex in struct field

Signed-off-by: noqcks <benny@noqcks.io>
2018-08-07 00:33:45 +02:00
Julius Volz 90521a65f8
Remove error return value from NotifyFunc() (#4459)
It's always nil and we also forgot to check it.

Signed-off-by: Julius Volz <julius.volz@gmail.com>
2018-08-04 21:31:12 +02:00
Ganesh Vernekar f1db699dff Persist alert 'for' state across restarts (#4061)
Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
2018-08-02 11:18:24 +01:00
Max Leonard Inden 71fafad099
api/v1: Coninue work exposing rules and alerts
Signed-off-by: Max Leonard Inden <IndenML@gmail.com>
2018-07-30 15:31:51 +02:00
Bryan Boreham afdb66dfac Expose Group.CopyState() (#4304)
This makes the `rules` package more useful to projects that use
Prometheus as a library.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2018-07-18 15:14:38 +02:00
Julius Volz 9e3171f6e3 rules: Minor naming/comment cleanups (#4328)
Signed-off-by: Julius Volz <julius.volz@gmail.com>
2018-07-18 04:54:33 +01:00
Alin Sinpalean 9dc763cc03 Run rule evaluation with timestamps precisely evaluation_interval apart (#4201)
* Run rule evaluation with timestamps precisely evaluation_interval apart from one another.

Signed-off-by: Alin Sinpalean <alin.sinpalean@gmail.com>
2018-06-01 15:23:07 +01:00
Mario Trangoni 464e747f1e fix some comments typos (#4059) 2018-04-08 10:51:54 +01:00
Bryan Boreham 93494d8b7e Add an OpenTracing span for each rule (#4027)
* Add an OpenTracing span for each rule

So that tags and child spans can be traced back to the rule that they
refer to.
2018-03-30 21:29:19 +01:00
ferhat elmas ffa673f7d8 General simplifications (#3887)
Another try as in #1516
2018-02-26 07:58:10 +00:00
Fabian Reinartz 7ccd4b39b8 *: implement query params
This adds a parameter to the storage selection interface which allows
query engine(s) to pass information about the operations surrounding a
data selection.
This can for example be used by remote storage backends to infer the
correct downsampling aggregates that need to be provided.
2018-02-13 12:17:22 +01:00
Brian Brazil 30b4439bbd Remove rule_type label from rule metrics.
This is not really needed now that we have rule groups
to distinguish rules.
2017-12-04 11:44:38 +00:00
Brian Brazil b97f4cf48c Add metrics for rule group interval and last duration. 2017-12-04 11:44:38 +00:00
Brian Brazil 0a42a9fc8f Copy over rule group duration on reload.
This is currently getting lost, this will soon be in a metric and we
don't want it dropping to 0 on every reload.
2017-12-04 11:44:38 +00:00
Brian Brazil aa370fa568 Clarify metric names around rule groups.
Make it clear they're about overall rule groups.
2017-12-04 11:44:38 +00:00
Fabian Reinartz 62461379b7 rules: decouple notifier packages
The dependency on the notifier packages caused a transitive dependency
on discovery and with that all client libraries our service discovery
uses.
2017-11-27 16:38:14 +01:00
Fabian Reinartz 4d964a0a0d rules: make glob expansion a concern of main 2017-11-24 08:22:57 +01:00
Fabian Reinartz bd9f7460eb rules: remove config package dependency 2017-11-24 07:57:54 +01:00
Fabian Reinartz 2d0e3746ac rules: remove dependency on promql.Engine 2017-11-24 07:57:54 +01:00
Goutham Veeramachaneni a880c86375
Fix unexported method on exported interface.
Also move to model.Duration

Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-11-23 19:13:57 +05:30
conorbroderick 55aaece116 Add rule evaluation time 2017-11-22 15:22:02 +00:00
Goutham Veeramachaneni e1117715fe
rules: remove skipped iterations cuz no throttling
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-11-14 17:33:00 +05:30
Krasi Georgiev e86d82ad2d Fix regression of alert rules state loss on config reload. (#3382)
* incorrect map name for the group prevented copying state from existing alert rules on config reload

* applyConfig test

* few nits

* nits 2
2017-11-01 12:58:00 +01:00
Julius Volz 099df0c5f0 Migrate "golang.org/x/net/context" -> "context" (#3333)
In some places, where ctxhttp or gRPC are concerned, we still need to use the
old contexts.
2017-10-24 21:21:42 -07:00
Brian Brazil ee88f0d222 Ensure all values are used or _ 2017-10-09 19:44:03 +01:00
Fabian Reinartz 2d0b8e8b94 Merge branch 'master' into dev-2.0 2017-10-05 13:09:18 +02:00
beorn7 c2e9a151ab Make all rule links link to the "Console" tab rather than "Graph"
Clicking on a rule, either the name or the expression, opens the rule
result (or the corresponding expression, repsectively) in the
expression browser. This should by default happen in the console tab,
as, more often than not, displaying it in the graph tab runs into a
timeout.
2017-09-21 18:28:00 +02:00
Fabian Reinartz d21f149745 *: migrate to go-kit/log 2017-09-08 22:01:51 +05:30
Goutham Veeramachaneni 37e7b69f56
Merge remote-tracking branch 'upstream/dev-2.0' into rulegroups
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-06-19 16:34:55 +05:30
Goutham Veeramachaneni c472316fb3
Check done before every rule evaluation.
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-06-16 16:57:22 +05:30
Goutham Veeramachaneni 6b70a4d850
Incorporate PR feedback
* Move fingerprint to Hash()
* Move away from tsdb.MultiError
* 0777 -> 0666 for files
* checkOverflow of extra fields

Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-06-16 16:44:33 +05:30
Goutham Veeramachaneni 507790a357
Rework logging to use explicitly passed logger
Mostly cleaned up the global logger use. Still some uses in discovery
package.

Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-06-16 15:52:44 +05:30
Goutham Veeramachaneni dc69645e92
Move back to go-yaml
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-06-16 10:46:21 +05:30
Goutham Veeramachaneni 5ff283a7b7
Reflect the grouping in the UI
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-06-14 16:09:14 +05:30
Goutham Veeramachaneni 8cca666cf2
Add file name to group.
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-06-14 15:18:39 +05:30
Goutham Veeramachaneni e893c89333
Validate labels and annotations
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-06-14 15:07:54 +05:30
Goutham Veeramachaneni a48a018368
Make sure groups are unique in a single file
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-06-14 12:19:21 +05:30
Goutham Veeramachaneni cea1e99f78
Add update-rules command to promtool
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-06-14 11:38:54 +05:30
Goutham Veeramachaneni e8f55669ea Move rules to new format
Signed-off-by: Goutham Veeramachaneni <goutham@boomerangcommerce.com>
2017-06-12 18:14:39 +05:30
Brian Brazil cc867dae60 Copy previous series and alert state more intelligently.
Usually rules don't more around, and if they do it's likely
that rules/alerts with the same name stay in the same order.

If rules/alerts with the same name are added/removed this
could cause a blip for one cycle, but this is unavoidable
without requiring rule and alert names to be unique - which we don't
want to do.
2017-05-24 13:52:45 +01:00
Brian Brazil 9bc68db7e6 Track staleness per rule rather than per group. 2017-05-24 13:52:45 +01:00
Brian Brazil 0451d6d31b Add unittest for rule staleness, and rules generally. 2017-05-24 13:52:45 +01:00
Brian Brazil 0400f3cfd2 Very basic staleness handling for rules. 2017-05-24 13:52:45 +01:00
Fabian Reinartz 06c2b76cd4 Merge branch 'master' into uptsdb 2017-05-16 16:48:37 +02:00
Julius Volz ac203ef0ee Add externalURL template function (#2716)
This allows users to e.g. add links back to the generating Prometheus
right in their alert templates.
2017-05-13 15:47:04 +02:00
Fabian Reinartz 8ffc851147 Merge branch 'master' into dev-2.0 2017-04-04 15:17:56 +02:00
Tobias Schmidt eaf33759fb Register forgotten prometheus_evaluator_iterations_total metric 2017-04-02 20:32:56 -03:00
Tobias Schmidt aaaba57184 Export number of missed rule evaluations
In case the execution of all rules takes longer than the configured rule
evaluation interval, one or more iterations will be skipped. This needs
to be visible to the opterator.
2017-04-02 20:03:28 -03:00
Fabian Reinartz 5772f1a7ba retrieval/storage: adapt to new interface
This simplifies the interface to two add methods for
appends with labels or faster reference numbers.
2017-02-02 13:05:46 +01:00
Fabian Reinartz ad9bc62e4c storage: extend appender and adapt it 2017-01-13 14:48:01 +01:00
Fabian Reinartz e94b0899ee rules: fix tests, remove model types 2016-12-29 17:31:14 +01:00
Fabian Reinartz f8fc1f5bb2 *: migrate ingestion to new batch Appender 2016-12-29 11:03:56 +01:00
Fabian Reinartz 5817cb5bde *: migrate from model.* to promql.* types 2016-12-25 00:37:46 +01:00
Jonathan Lange d78dd3593d Set evaluation interval on Group construction
Prevents having object in invalid state, and allows users of public API
to construct valid Groups.
2016-11-18 16:32:30 +00:00
Jonathan Lange 31fc357cd8 Make NewGroup and Group.Eval public
Allows callers to execute evaluate lists of rules without first writing
them to disk.
2016-11-18 16:25:58 +00:00
Jonathan Lange 2a2da40223 Make rule evaluation publicly available
Means that a third-party can parse rules and run them with their own
execution model.
2016-11-18 16:12:50 +00:00
Matt Bostock 926a5ab3dd rules/manager.go: Fix race between reload and stop
On one relatively large Prometheus instance (1.7M series), I noticed
that upgrades were frequently resulting in Prometheus undergoing crash
recovery on start-up.

On closer examination, I found that Prometheus was panicking on
shutdown.

It seems that our configuration management (or misconfiguration thereof)
is reloading Prometheus then immediately restarting it, which I suspect
is causing this race:

    Sep 21 15:12:42 host systemd[1]: Reloading prometheus monitoring system.
    Sep 21 15:12:42 host prometheus[18734]: time="2016-09-21T15:12:42Z" level=info msg="Loading configuration file /etc/prometheus/config.yaml" source="main.go:221"
    Sep 21 15:12:42 host systemd[1]: Reloaded prometheus monitoring system.
    Sep 21 15:12:44 host systemd[1]: Stopping prometheus monitoring system...
    Sep 21 15:12:44 host prometheus[18734]: time="2016-09-21T15:12:44Z" level=warning msg="Received SIGTERM, exiting gracefully..." source="main.go:203"
    Sep 21 15:12:44 host prometheus[18734]: time="2016-09-21T15:12:44Z" level=info msg="See you next time!" source="main.go:210"
    Sep 21 15:12:44 host prometheus[18734]: time="2016-09-21T15:12:44Z" level=info msg="Stopping target manager..." source="targetmanager.go:90"
    Sep 21 15:12:52 host prometheus[18734]: time="2016-09-21T15:12:52Z" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:548"
    Sep 21 15:12:56 host prometheus[18734]: time="2016-09-21T15:12:56Z" level=warning msg="Error on ingesting out-of-order samples" numDropped=1 source="scrape.go:467"
    Sep 21 15:12:56 host prometheus[18734]: time="2016-09-21T15:12:56Z" level=error msg="Error adding file watch for \"/etc/prometheus/targets\": no such file or directory" source="file.go:84"
    Sep 21 15:12:56 host prometheus[18734]: time="2016-09-21T15:12:56Z" level=error msg="Error adding file watch for \"/etc/prometheus/targets\": no such file or directory" source="file.go:84"
    Sep 21 15:13:01 host prometheus[18734]: time="2016-09-21T15:13:01Z" level=info msg="Stopping rule manager..." source="manager.go:366"
    Sep 21 15:13:01 host prometheus[18734]: time="2016-09-21T15:13:01Z" level=info msg="Rule manager stopped." source="manager.go:372"
    Sep 21 15:13:01 host prometheus[18734]: time="2016-09-21T15:13:01Z" level=info msg="Stopping notification handler..." source="notifier.go:325"
    Sep 21 15:13:01 host prometheus[18734]: time="2016-09-21T15:13:01Z" level=info msg="Stopping local storage..." source="storage.go:381"
    Sep 21 15:13:01 host prometheus[18734]: time="2016-09-21T15:13:01Z" level=info msg="Stopping maintenance loop..." source="storage.go:383"
    Sep 21 15:13:01 host prometheus[18734]: panic: close of closed channel
    Sep 21 15:13:01 host prometheus[18734]: goroutine 7686074 [running]:
    Sep 21 15:13:01 host prometheus[18734]: panic(0xba57a0, 0xc60c42b500)
    Sep 21 15:13:01 host prometheus[18734]: /usr/local/go/src/runtime/panic.go:500 +0x1a1
    Sep 21 15:13:01 host prometheus[18734]: github.com/prometheus/prometheus/rules.(*Manager).ApplyConfig.func1(0xc6645a9901, 0xc420271ef0, 0xc420338ed0, 0xc60c42b4f0, 0xc6645a9900)
    Sep 21 15:13:01 host prometheus[18734]: /home/build/packages/prometheus/tmp/build/gopath/src/github.com/prometheus/prometheus/rules/manager.go:412 +0x3c
    Sep 21 15:13:01 host prometheus[18734]: created by github.com/prometheus/prometheus/rules.(*Manager).ApplyConfig
    Sep 21 15:13:01 host prometheus[18734]: /home/build/packages/prometheus/tmp/build/gopath/src/github.com/prometheus/prometheus/rules/manager.go:423 +0x56b
    Sep 21 15:13:03 host systemd[1]: prometheus.service: main process exited, code=exited, status=2/INVALIDARGUMENT
2016-09-21 22:03:02 +01:00
Julius Volz c187308366 storage: Contextify storage interfaces.
This is based on https://github.com/prometheus/prometheus/pull/1997.

This adds contexts to the relevant Storage methods and already passes
PromQL's new per-query context into the storage's query methods.
The immediate motivation supporting multi-tenancy in Frankenstein, but
this could also be used by Prometheus's normal local storage to support
cancellations and timeouts at some point.
2016-09-19 16:29:07 +02:00
Julius Volz ed5a0f0abe promql: Allow per-query contexts.
For Weaveworks' Frankenstein, we need to support multitenancy. In
Frankenstein, we initially solved this without modifying the promql
package at all: we constructed a new promql.Engine for every
query and injected a storage implementation into that engine which would
be primed to only collect data for a given user.

This is problematic to upstream, however. Prometheus assumes that there
is only one engine: the query concurrency gate is part of the engine,
and the engine contains one central cancellable context to shut down all
queries. Also, creating a new engine for every query seems like overkill.

Thus, we want to be able to pass per-query contexts into a single engine.

This change gets rid of the promql.Engine's built-in base context and
allows passing in a per-query context instead. Central cancellation of
all queries is still possible by deriving all passed-in contexts from
one central one, but this is now the responsibility of the caller. The
central query context is now created in main() and passed into the
relevant components (web handler / API, rule manager).

In a next step, the per-query context would have to be passed to the
storage implementation, so that the storage can implement multi-tenancy
or other features based on the contextual information.
2016-09-19 15:38:17 +02:00
Dmitry Vorobev 273e457da4 web: return status code and error message for config resource 2016-07-15 10:15:24 +02:00
Brian Brazil 0509b0f2db Expand alert templates at eval time.
Fixes #1678 #1677
2016-07-12 17:13:55 +01:00
beorn7 064b57858e Consistently use the Seconds() method for conversion of durations
This also fixes one remaining case of recording integral numbers
of seconds only for a metric, i.e. this will probably fix #1796.
2016-07-07 15:24:35 +02:00
beorn7 45e5775f9b Add missing logging of out-of-order samples
So far, out-of-order samples during rule evaluation were not logged,
and neither scrape health samples. The latter are unlikely to cause
any errors. That's why I'm logging them always now. (It's alway highly
irregular should it happen.) For rules, I have used the same plumbing
as for samples, just with a different wording in the message to mark
them as a result of rule evaluation.
2016-05-19 16:22:53 +02:00
Fabian Reinartz d89c254849 Make copying alerting state safer.
This considers static labels in the equality of alerts to
avoid falsely copying state from a different alert definition with
the same name across reloads.

To be safe, it also copies the state map rather than just its pointer
so that remaining collisions disappear after one evaluation interval.
2016-03-02 12:21:54 +01:00
Fabian Reinartz bfa8aaa017 Rename notification to notifier 2016-03-01 12:39:08 +01:00
beorn7 663a1550d0 Fix the instrumentation fixes 2016-02-17 15:50:55 +01:00
beorn7 ec08c9a391 Rework the way to communicate backpressure (AKA suspended ingestion)
This gives up on the idea to communicate throuh the Append() call (by
either not returning as it is now or returning an error as
suggested/explored elsewhere). Here I have added a Throttled() call,
which has the advantage that it can be called before a whole _batch_
of Append()'s. Scrapes will happen completely or not at all. Same for
rule group evaluations. That's a highly desired behavior (as discussed
elsewhere). The code is even simpler now as the whole ingestion buffer
could be removed.

Logging of throttled mode has been streamlined and will create at most
one message per minute.
2016-02-01 14:45:44 +01:00
Fabian Reinartz b0adfea8d5 Fix swapped constants, improve instrumentation 2016-01-21 12:15:29 +01:00
Fabian Reinartz a8c38c3ac5 Don't log rule evaluation failure on shutdown 2016-01-18 17:34:25 +01:00
Fabian Reinartz 6eee86dce8 Terminate rule groups during initial sleep
When an evaluation group runs initially, it waits a deterministic
amount of time. During that time it also has to accept
a termination singnal so shutdown doesn't hang during the first
evaluation iteration after a configuration reload.

Fixes #1307
2016-01-12 10:54:09 +01:00
Fabian Reinartz 37d80c4b25 Fix premature rule evaluation
This commit prevents rule evaluation from starting until after
the storage is ready.
2016-01-08 17:51:22 +01:00
Fabian Reinartz 0cf3c6a9ef Add comments, rename a method 2015-12-23 12:29:28 +01:00
Fabian Reinartz bf6abac8f4 Send resolved notifications 2015-12-17 15:42:26 +01:00
Fabian Reinartz f69e668fc4 Improve rules/ instrumentation
This commit adds a counter for the total number of rule evaluations
and standardizes the units to seconds.
2015-12-17 15:42:26 +01:00
Fabian Reinartz 52e5224f5a Refactor rules/ package 2015-12-17 15:42:25 +01:00
Fabian Reinartz e4fabe135a Set StartsAt to time of first firing state 2015-12-17 11:36:58 +01:00
Fabian Reinartz 7c90db22ed Use annotation based alerts in rules/
This commit breaks the previously used alert format.
2015-12-14 10:16:07 +01:00
Fabian Reinartz e114ce0ff7 Refactor notification handler 2015-12-11 15:17:32 +01:00
Fabian Reinartz e3b6ec9784 Switch to common/log 2015-10-03 10:21:43 +02:00
Julius Volz 995d3b831d Fix most golint warnings.
This is with `golint -min_confidence=0.5`.

I left several lint warnings untouched because they were either
incorrect or I felt it was better not to change them at the moment.
2015-08-26 12:44:46 +02:00
Fabian Reinartz d6b8da8d43 Switch promql types to common/model 2015-08-25 13:49:14 +02:00
Brian Brazil fdf0d0642e Cast value to float, as that's what the console templates expect. 2015-08-24 16:59:08 +01:00
Fabian Reinartz 438e232c9b Fix grouping of import blocks 2015-08-22 09:42:45 +02:00
Fabian Reinartz 306e8468a0 Switch from client_golang/model to common/model 2015-08-21 13:33:38 +02:00
Fabian Reinartz 7a67472fc1 Resolve relative paths on configuration loading
This moves the concern of resolving the files relative to the config
file into the configuration loading itself.
It also fixes #921 which did not load the cert and token files relatively.
2015-08-05 18:08:04 +02:00
Fabian Reinartz feb8a03503 rules: load rule files relative to a base dir 2015-07-03 15:10:37 +02:00
Julius Volz fcff35b43e Consolidate external reachability flags into one.
Besides fixing https://github.com/prometheus/prometheus/issues/805 by
making the entire externally reachable server URL configurable, this
adds tests for the "globalURL" template function and makes it easier to
test other such functions in the future.

This breaks the `web.Hostname` flag (and introduces `web.external-url`).
This flag is likely only used by few users, so I hope that's
justifiable.

Fixes https://github.com/prometheus/prometheus/issues/805
2015-07-03 13:39:10 +02:00
Fabian Reinartz f06cf664e1 rules: cleanup alerting test 2015-06-30 18:22:24 +02:00
Fabian Reinartz 9bd4f6d017 rules: preserve alert state across reloads. 2015-06-30 11:32:07 +02:00
Fabian Reinartz 4625485b84 rules: move rules*.go contents to manager*.go 2015-06-30 11:32:07 +02:00
Fabian Reinartz 749ae450c5 promql: add runbook to alert statement.
This commit adds the RUNBOOK keyword to alert statements. The field
is optional and expected to be a link.
2015-06-25 13:00:52 +02:00
Fabian Reinartz 5e13880201 General cleanup of rules. 2015-06-06 21:40:52 +02:00
Fabian Reinartz 280d11dca8 main: exit on invalid rule files on startup. 2015-06-02 18:44:41 +02:00
Fabian Reinartz 0de6edbdfc Move pkg/ to util/ 2015-06-01 21:12:32 +02:00
Fabian Reinartz dbc0d30e3e Move string functionality to pkg/strutil 2015-06-01 21:12:32 +02:00
Fabian Reinartz f45a5cab60 Move templates package to pkg/template 2015-06-01 21:12:31 +02:00
Fabian Reinartz c44ac7bc26 Load rule files from entire directories 2015-06-01 21:12:31 +02:00
Julius Volz ff53d10849 Fix double slash in GeneratorURL sent to alertmanager.
Fixes https://github.com/prometheus/prometheus/issues/722
2015-05-23 19:16:57 +02:00
Julius Volz 267fd34156 Switch Prometheus to use github.com/prometheus/log.
This change is conceptually very simple, although the diff is large. It
switches logging from "github.com/golang/glog" to
"github.com/prometheus/log", while not actually changing any log
messages. V(1)-style logging has been changed to be log.Debug*().
2015-05-20 18:19:32 +02:00
Fabian Reinartz bb540fd9fd Implement config reloading on SIGHUP.
With this commit, sending SIGHUP to the Prometheus process will reload
and apply the configuration file. The different components attempt
to handle failing changes gracefully.
2015-05-13 16:49:46 +02:00
Fabian Reinartz fe935179cd Stop routing rule statements through the engine. 2015-04-29 18:01:43 +02:00
Fabian Reinartz 479891c9be Rename RuleManager to Manager, remove interface.
This commits renames the RuleManager to Manager as the package
name is 'rules' now. The unused layer of abstraction of the
RuleManager interface is removed.
2015-04-29 16:42:10 +02:00
Fabian Reinartz 3ca11bcaf5 Switch Prometheus to promql package.
This commit removes all functionality from rules/ that is now handled in
promql/.
All parts of Prometheus are changed to use the promql/ package.
2015-04-28 16:19:23 +02:00
Brian Brazil e041c0cd46 Add console and alert templates with access to all data.
Move rulemanager to it's own package to break cicrular dependency.
Make NewTestTieredStorage available to tests, remove duplication.

Change-Id: I33b321245a44aa727bfc3614a7c9ae5005b34e03
2014-05-30 16:24:56 +01:00
Julius Volz 01f652cb4c Separate storage implementation from interfaces.
This was initially motivated by wanting to distribute the rule checker
tool under `tools/rule_checker`. However, this was not possible without
also distributing the LevelDB dynamic libraries because the tool
transitively depended on Levigo:

rule checker -> query layer -> tiered storage layer -> leveldb

This change separates external storage interfaces from the
implementation (tiered storage, leveldb storage, memory storage) by
putting them into separate packages:

- storage/metric: public, implementation-agnostic interfaces
- storage/metric/tiered: tiered storage implementation, including memory
                         and LevelDB storage.

I initially also considered splitting up the implementation into
separate packages for tiered storage, memory storage, and LevelDB
storage, but these are currently so intertwined that it would be another
major project in itself.

The query layers and most other parts of Prometheus now have notion of
the storage implementation anymore and just use whatever implementation
they get passed in via interfaces.

The rule_checker is now a static binary :)

Change-Id: I793bbf631a8648ca31790e7e772ecf9c2b92f7a0
2014-04-16 13:30:19 +02:00
Julius Volz 20bfaf80ab Merge "Display filename when encountering bad rule file." 2013-12-13 15:01:02 +01:00
Julius Volz 3bf3a555b2 Merge "add evalDuration histogram and ruleCount counter for rules" 2013-12-11 22:52:19 +01:00
Stuart Nelson b75adfebad add evalDuration histogram and ruleCount counter for rules
Change-Id: I3508fe72526348d96b8158828388c3ac8d7c3fa9
2013-12-11 15:42:53 -05:00
Julius Volz 77a79d1fc0 Display filename when encountering bad rule file.
Change-Id: I4729371be92c5659a6938145c5fde66771d7be22
2013-12-11 15:44:11 +01:00
Julius Volz fb44580110 Cleanup/fix program termination sequence.
Change-Id: I2bc58a2583fb079c9ef383cfc7a5e0fbe613f1cd
2013-12-11 15:40:32 +01:00
Julius Volz 740d448983 Use custom timestamp type for sample timestamps and related code.
So far we've been using Go's native time.Time for anything related to sample
timestamps. Since the range of time.Time is much bigger than what we need, this
has created two problems:

- there could be time.Time values which were out of the range/precision of the
  time type that we persist to disk, therefore causing incorrectly ordered keys.
  One bug caused by this was:

  https://github.com/prometheus/prometheus/issues/367

  It would be good to use a timestamp type that's more closely aligned with
  what the underlying storage supports.

- sizeof(time.Time) is 192, while Prometheus should be ok with a single 64-bit
  Unix timestamp (possibly even a 32-bit one). Since we store samples in large
  numbers, this seriously affects memory usage. Furthermore, copying/working
  with the data will be faster if it's smaller.

*MEMORY USAGE RESULTS*
Initial memory usage comparisons for a running Prometheus with 1 timeseries and
100,000 samples show roughly a 13% decrease in total (VIRT) memory usage. In my
tests, this advantage for some reason decreased a bit the more samples the
timeseries had (to 5-7% for millions of samples). This I can't fully explain,
but perhaps garbage collection issues were involved.

*WHEN TO USE THE NEW TIMESTAMP TYPE*
The new clientmodel.Timestamp type should be used whenever time
calculations are either directly or indirectly related to sample
timestamps.

For example:
- the timestamp of a sample itself
- all kinds of watermarks
- anything that may become or is compared to a sample timestamp (like the timestamp
  passed into Target.Scrape()).

When to still use time.Time:
- for measuring durations/times not related to sample timestamps, like duration
  telemetry exporting, timers that indicate how frequently to execute some
  action, etc.

*NOTE ON OPERATOR OPTIMIZATION TESTS*
We don't use operator optimization code anymore, but it still lives in
the code as dead code. It still has tests, but I couldn't get all of them to
pass with the new timestamp format. I commented out the failing cases for now,
but we should probably remove the dead code soon. I just didn't want to do that
in the same change as this.

Change-Id: I821787414b0debe85c9fffaeb57abd453727af0f
2013-12-03 09:11:28 +01:00
Julius Volz 1eb1ceac8c Add alert-expression console links to notifications.
The ConsoleLinkForExpression() function now escapes console URLs in such a way
that works both in emails and in HTML.

Change-Id: I917bae0b526cbbac28ccd2a4ec3c5ac03ee4c647
2013-08-20 15:45:41 +02:00
Julius Volz aa5d251f8d Use github.com/golang/glog for all logging. 2013-08-12 17:54:36 +02:00
Julius Volz 3b970c5133 Add variable interpolation to notification messages.
This includes required refactorings to enable replacing the http client (for
testing) and moving the NotificationReq type definitions to the "notifications"
package, so that this package doesn't need to depend on "rules" anymore and
that it can instead use a representation of the required data which only
includes the necessary fields.
2013-08-12 12:29:08 +02:00
Julius Volz 35ee2cd3cb Add alertmanager notification support to Prometheus.
Alert definitions now also have mandatory SUMMARY and DESCRIPTION fields
that get sent along a firing alert to the alert manager.
2013-07-30 17:23:41 +02:00
Matt T. Proud 30b1cf80b5 WIP - Snapshot of Moving to Client Model. 2013-06-25 15:52:42 +02:00
Julius Volz 0226d1ac7a Implement alerts dashboard and expression console links. 2013-06-13 22:35:40 +02:00
Julius Volz ba29d07901 Show loaded rules in Status dashboard. 2013-06-11 11:39:31 +02:00
Julius Volz adb87816f4 Put RuleManager concurrency in hands of caller, fix races. 2013-06-05 13:56:56 +02:00
Matt T. Proud c10780c966 Introduce telemetry for rule evaluator durations.
This commit adds telemetry for the Prometheus expression rule
evaluator, which will enable meta-Prometheus monitoring of customers
to ensure that no instance is falling behind in answering routine
queries.

A few other sundry simplifications are introduced, too.
2013-05-23 21:29:27 +02:00
Julius Volz 56324d8ce2 Make AST query storage non-global. 2013-05-07 13:15:10 +02:00
Julius Volz 9cea5d9df8 Convert the Prometheus configuration to protocol buffers. 2013-04-30 22:26:00 +02:00
Julius Volz d8110fcd9c Send sample arrays instead of single samples over channels. 2013-04-29 17:24:17 +02:00
Julius Volz 2202cd71c9 Track alerts over time and write out alert timeseries. 2013-04-26 14:35:21 +02:00
Julius Volz c0601abf46 Implement initial no-op alert parsing and rule parsing tests. 2013-04-23 13:48:24 +02:00
Julius Volz 1eb586db7d Fix rule evaluation closure. 2013-04-17 15:11:21 +02:00
Julius Volz c4d0969c00 Propagate more errors during rule evaluation. 2013-04-09 13:47:20 +02:00
Matt T. Proud ea54751431 Update import paths to new location.
This repository moved from matttproud/prometheus to
prometheus/prometheus, and all import paths need to be updated.
2013-01-27 18:49:45 +01:00
Julius Volz 483bd81a44 Allow grammar to parse both rules and single expressions. 2013-01-11 01:17:37 +01:00
Julius Volz 56384bf42a Add initial config and rule language implementation. 2013-01-07 23:43:36 +01:00