Commit graph

225 commits

Author SHA1 Message Date
Jonathan Lange 2a2da40223 Make rule evaluation publicly available
Means that a third-party can parse rules and run them with their own
execution model.
2016-11-18 16:12:50 +00:00
Matt Bostock 926a5ab3dd rules/manager.go: Fix race between reload and stop
On one relatively large Prometheus instance (1.7M series), I noticed
that upgrades were frequently resulting in Prometheus undergoing crash
recovery on start-up.

On closer examination, I found that Prometheus was panicking on
shutdown.

It seems that our configuration management (or misconfiguration thereof)
is reloading Prometheus then immediately restarting it, which I suspect
is causing this race:

    Sep 21 15:12:42 host systemd[1]: Reloading prometheus monitoring system.
    Sep 21 15:12:42 host prometheus[18734]: time="2016-09-21T15:12:42Z" level=info msg="Loading configuration file /etc/prometheus/config.yaml" source="main.go:221"
    Sep 21 15:12:42 host systemd[1]: Reloaded prometheus monitoring system.
    Sep 21 15:12:44 host systemd[1]: Stopping prometheus monitoring system...
    Sep 21 15:12:44 host prometheus[18734]: time="2016-09-21T15:12:44Z" level=warning msg="Received SIGTERM, exiting gracefully..." source="main.go:203"
    Sep 21 15:12:44 host prometheus[18734]: time="2016-09-21T15:12:44Z" level=info msg="See you next time!" source="main.go:210"
    Sep 21 15:12:44 host prometheus[18734]: time="2016-09-21T15:12:44Z" level=info msg="Stopping target manager..." source="targetmanager.go:90"
    Sep 21 15:12:52 host prometheus[18734]: time="2016-09-21T15:12:52Z" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:548"
    Sep 21 15:12:56 host prometheus[18734]: time="2016-09-21T15:12:56Z" level=warning msg="Error on ingesting out-of-order samples" numDropped=1 source="scrape.go:467"
    Sep 21 15:12:56 host prometheus[18734]: time="2016-09-21T15:12:56Z" level=error msg="Error adding file watch for \"/etc/prometheus/targets\": no such file or directory" source="file.go:84"
    Sep 21 15:12:56 host prometheus[18734]: time="2016-09-21T15:12:56Z" level=error msg="Error adding file watch for \"/etc/prometheus/targets\": no such file or directory" source="file.go:84"
    Sep 21 15:13:01 host prometheus[18734]: time="2016-09-21T15:13:01Z" level=info msg="Stopping rule manager..." source="manager.go:366"
    Sep 21 15:13:01 host prometheus[18734]: time="2016-09-21T15:13:01Z" level=info msg="Rule manager stopped." source="manager.go:372"
    Sep 21 15:13:01 host prometheus[18734]: time="2016-09-21T15:13:01Z" level=info msg="Stopping notification handler..." source="notifier.go:325"
    Sep 21 15:13:01 host prometheus[18734]: time="2016-09-21T15:13:01Z" level=info msg="Stopping local storage..." source="storage.go:381"
    Sep 21 15:13:01 host prometheus[18734]: time="2016-09-21T15:13:01Z" level=info msg="Stopping maintenance loop..." source="storage.go:383"
    Sep 21 15:13:01 host prometheus[18734]: panic: close of closed channel
    Sep 21 15:13:01 host prometheus[18734]: goroutine 7686074 [running]:
    Sep 21 15:13:01 host prometheus[18734]: panic(0xba57a0, 0xc60c42b500)
    Sep 21 15:13:01 host prometheus[18734]: /usr/local/go/src/runtime/panic.go:500 +0x1a1
    Sep 21 15:13:01 host prometheus[18734]: github.com/prometheus/prometheus/rules.(*Manager).ApplyConfig.func1(0xc6645a9901, 0xc420271ef0, 0xc420338ed0, 0xc60c42b4f0, 0xc6645a9900)
    Sep 21 15:13:01 host prometheus[18734]: /home/build/packages/prometheus/tmp/build/gopath/src/github.com/prometheus/prometheus/rules/manager.go:412 +0x3c
    Sep 21 15:13:01 host prometheus[18734]: created by github.com/prometheus/prometheus/rules.(*Manager).ApplyConfig
    Sep 21 15:13:01 host prometheus[18734]: /home/build/packages/prometheus/tmp/build/gopath/src/github.com/prometheus/prometheus/rules/manager.go:423 +0x56b
    Sep 21 15:13:03 host systemd[1]: prometheus.service: main process exited, code=exited, status=2/INVALIDARGUMENT
2016-09-21 22:03:02 +01:00
Julius Volz c187308366 storage: Contextify storage interfaces.
This is based on https://github.com/prometheus/prometheus/pull/1997.

This adds contexts to the relevant Storage methods and already passes
PromQL's new per-query context into the storage's query methods.
The immediate motivation supporting multi-tenancy in Frankenstein, but
this could also be used by Prometheus's normal local storage to support
cancellations and timeouts at some point.
2016-09-19 16:29:07 +02:00
Julius Volz ed5a0f0abe promql: Allow per-query contexts.
For Weaveworks' Frankenstein, we need to support multitenancy. In
Frankenstein, we initially solved this without modifying the promql
package at all: we constructed a new promql.Engine for every
query and injected a storage implementation into that engine which would
be primed to only collect data for a given user.

This is problematic to upstream, however. Prometheus assumes that there
is only one engine: the query concurrency gate is part of the engine,
and the engine contains one central cancellable context to shut down all
queries. Also, creating a new engine for every query seems like overkill.

Thus, we want to be able to pass per-query contexts into a single engine.

This change gets rid of the promql.Engine's built-in base context and
allows passing in a per-query context instead. Central cancellation of
all queries is still possible by deriving all passed-in contexts from
one central one, but this is now the responsibility of the caller. The
central query context is now created in main() and passed into the
relevant components (web handler / API, rule manager).

In a next step, the per-query context would have to be passed to the
storage implementation, so that the storage can implement multi-tenancy
or other features based on the contextual information.
2016-09-19 15:38:17 +02:00
Dmitry Vorobev 273e457da4 web: return status code and error message for config resource 2016-07-15 10:15:24 +02:00
Brian Brazil 0509b0f2db Expand alert templates at eval time.
Fixes #1678 #1677
2016-07-12 17:13:55 +01:00
beorn7 064b57858e Consistently use the Seconds() method for conversion of durations
This also fixes one remaining case of recording integral numbers
of seconds only for a metric, i.e. this will probably fix #1796.
2016-07-07 15:24:35 +02:00
beorn7 45e5775f9b Add missing logging of out-of-order samples
So far, out-of-order samples during rule evaluation were not logged,
and neither scrape health samples. The latter are unlikely to cause
any errors. That's why I'm logging them always now. (It's alway highly
irregular should it happen.) For rules, I have used the same plumbing
as for samples, just with a different wording in the message to mark
them as a result of rule evaluation.
2016-05-19 16:22:53 +02:00
Fabian Reinartz d89c254849 Make copying alerting state safer.
This considers static labels in the equality of alerts to
avoid falsely copying state from a different alert definition with
the same name across reloads.

To be safe, it also copies the state map rather than just its pointer
so that remaining collisions disappear after one evaluation interval.
2016-03-02 12:21:54 +01:00
Fabian Reinartz bfa8aaa017 Rename notification to notifier 2016-03-01 12:39:08 +01:00
beorn7 663a1550d0 Fix the instrumentation fixes 2016-02-17 15:50:55 +01:00
beorn7 ec08c9a391 Rework the way to communicate backpressure (AKA suspended ingestion)
This gives up on the idea to communicate throuh the Append() call (by
either not returning as it is now or returning an error as
suggested/explored elsewhere). Here I have added a Throttled() call,
which has the advantage that it can be called before a whole _batch_
of Append()'s. Scrapes will happen completely or not at all. Same for
rule group evaluations. That's a highly desired behavior (as discussed
elsewhere). The code is even simpler now as the whole ingestion buffer
could be removed.

Logging of throttled mode has been streamlined and will create at most
one message per minute.
2016-02-01 14:45:44 +01:00
Fabian Reinartz b0adfea8d5 Fix swapped constants, improve instrumentation 2016-01-21 12:15:29 +01:00
Fabian Reinartz a8c38c3ac5 Don't log rule evaluation failure on shutdown 2016-01-18 17:34:25 +01:00
Fabian Reinartz 6eee86dce8 Terminate rule groups during initial sleep
When an evaluation group runs initially, it waits a deterministic
amount of time. During that time it also has to accept
a termination singnal so shutdown doesn't hang during the first
evaluation iteration after a configuration reload.

Fixes #1307
2016-01-12 10:54:09 +01:00
Fabian Reinartz 37d80c4b25 Fix premature rule evaluation
This commit prevents rule evaluation from starting until after
the storage is ready.
2016-01-08 17:51:22 +01:00
Fabian Reinartz 0cf3c6a9ef Add comments, rename a method 2015-12-23 12:29:28 +01:00
Fabian Reinartz bf6abac8f4 Send resolved notifications 2015-12-17 15:42:26 +01:00
Fabian Reinartz f69e668fc4 Improve rules/ instrumentation
This commit adds a counter for the total number of rule evaluations
and standardizes the units to seconds.
2015-12-17 15:42:26 +01:00
Fabian Reinartz 52e5224f5a Refactor rules/ package 2015-12-17 15:42:25 +01:00
Fabian Reinartz e4fabe135a Set StartsAt to time of first firing state 2015-12-17 11:36:58 +01:00
Fabian Reinartz 7c90db22ed Use annotation based alerts in rules/
This commit breaks the previously used alert format.
2015-12-14 10:16:07 +01:00
Fabian Reinartz e114ce0ff7 Refactor notification handler 2015-12-11 15:17:32 +01:00
Fabian Reinartz e3b6ec9784 Switch to common/log 2015-10-03 10:21:43 +02:00
Julius Volz 995d3b831d Fix most golint warnings.
This is with `golint -min_confidence=0.5`.

I left several lint warnings untouched because they were either
incorrect or I felt it was better not to change them at the moment.
2015-08-26 12:44:46 +02:00
Fabian Reinartz d6b8da8d43 Switch promql types to common/model 2015-08-25 13:49:14 +02:00
Brian Brazil fdf0d0642e Cast value to float, as that's what the console templates expect. 2015-08-24 16:59:08 +01:00
Fabian Reinartz 438e232c9b Fix grouping of import blocks 2015-08-22 09:42:45 +02:00
Fabian Reinartz 306e8468a0 Switch from client_golang/model to common/model 2015-08-21 13:33:38 +02:00
Fabian Reinartz 7a67472fc1 Resolve relative paths on configuration loading
This moves the concern of resolving the files relative to the config
file into the configuration loading itself.
It also fixes #921 which did not load the cert and token files relatively.
2015-08-05 18:08:04 +02:00
Fabian Reinartz feb8a03503 rules: load rule files relative to a base dir 2015-07-03 15:10:37 +02:00
Julius Volz fcff35b43e Consolidate external reachability flags into one.
Besides fixing https://github.com/prometheus/prometheus/issues/805 by
making the entire externally reachable server URL configurable, this
adds tests for the "globalURL" template function and makes it easier to
test other such functions in the future.

This breaks the `web.Hostname` flag (and introduces `web.external-url`).
This flag is likely only used by few users, so I hope that's
justifiable.

Fixes https://github.com/prometheus/prometheus/issues/805
2015-07-03 13:39:10 +02:00
Fabian Reinartz f06cf664e1 rules: cleanup alerting test 2015-06-30 18:22:24 +02:00
Fabian Reinartz 9bd4f6d017 rules: preserve alert state across reloads. 2015-06-30 11:32:07 +02:00
Fabian Reinartz 4625485b84 rules: move rules*.go contents to manager*.go 2015-06-30 11:32:07 +02:00
Fabian Reinartz 749ae450c5 promql: add runbook to alert statement.
This commit adds the RUNBOOK keyword to alert statements. The field
is optional and expected to be a link.
2015-06-25 13:00:52 +02:00
Fabian Reinartz 5e13880201 General cleanup of rules. 2015-06-06 21:40:52 +02:00
Fabian Reinartz 280d11dca8 main: exit on invalid rule files on startup. 2015-06-02 18:44:41 +02:00
Fabian Reinartz 0de6edbdfc Move pkg/ to util/ 2015-06-01 21:12:32 +02:00
Fabian Reinartz dbc0d30e3e Move string functionality to pkg/strutil 2015-06-01 21:12:32 +02:00
Fabian Reinartz f45a5cab60 Move templates package to pkg/template 2015-06-01 21:12:31 +02:00
Fabian Reinartz c44ac7bc26 Load rule files from entire directories 2015-06-01 21:12:31 +02:00
Julius Volz ff53d10849 Fix double slash in GeneratorURL sent to alertmanager.
Fixes https://github.com/prometheus/prometheus/issues/722
2015-05-23 19:16:57 +02:00
Julius Volz 267fd34156 Switch Prometheus to use github.com/prometheus/log.
This change is conceptually very simple, although the diff is large. It
switches logging from "github.com/golang/glog" to
"github.com/prometheus/log", while not actually changing any log
messages. V(1)-style logging has been changed to be log.Debug*().
2015-05-20 18:19:32 +02:00
Fabian Reinartz bb540fd9fd Implement config reloading on SIGHUP.
With this commit, sending SIGHUP to the Prometheus process will reload
and apply the configuration file. The different components attempt
to handle failing changes gracefully.
2015-05-13 16:49:46 +02:00
Fabian Reinartz fe935179cd Stop routing rule statements through the engine. 2015-04-29 18:01:43 +02:00
Fabian Reinartz 479891c9be Rename RuleManager to Manager, remove interface.
This commits renames the RuleManager to Manager as the package
name is 'rules' now. The unused layer of abstraction of the
RuleManager interface is removed.
2015-04-29 16:42:10 +02:00
Fabian Reinartz 3ca11bcaf5 Switch Prometheus to promql package.
This commit removes all functionality from rules/ that is now handled in
promql/.
All parts of Prometheus are changed to use the promql/ package.
2015-04-28 16:19:23 +02:00
Brian Brazil e041c0cd46 Add console and alert templates with access to all data.
Move rulemanager to it's own package to break cicrular dependency.
Make NewTestTieredStorage available to tests, remove duplication.

Change-Id: I33b321245a44aa727bfc3614a7c9ae5005b34e03
2014-05-30 16:24:56 +01:00
Julius Volz 01f652cb4c Separate storage implementation from interfaces.
This was initially motivated by wanting to distribute the rule checker
tool under `tools/rule_checker`. However, this was not possible without
also distributing the LevelDB dynamic libraries because the tool
transitively depended on Levigo:

rule checker -> query layer -> tiered storage layer -> leveldb

This change separates external storage interfaces from the
implementation (tiered storage, leveldb storage, memory storage) by
putting them into separate packages:

- storage/metric: public, implementation-agnostic interfaces
- storage/metric/tiered: tiered storage implementation, including memory
                         and LevelDB storage.

I initially also considered splitting up the implementation into
separate packages for tiered storage, memory storage, and LevelDB
storage, but these are currently so intertwined that it would be another
major project in itself.

The query layers and most other parts of Prometheus now have notion of
the storage implementation anymore and just use whatever implementation
they get passed in via interfaces.

The rule_checker is now a static binary :)

Change-Id: I793bbf631a8648ca31790e7e772ecf9c2b92f7a0
2014-04-16 13:30:19 +02:00
Julius Volz 20bfaf80ab Merge "Display filename when encountering bad rule file." 2013-12-13 15:01:02 +01:00
Julius Volz 3bf3a555b2 Merge "add evalDuration histogram and ruleCount counter for rules" 2013-12-11 22:52:19 +01:00
Stuart Nelson b75adfebad add evalDuration histogram and ruleCount counter for rules
Change-Id: I3508fe72526348d96b8158828388c3ac8d7c3fa9
2013-12-11 15:42:53 -05:00
Julius Volz 77a79d1fc0 Display filename when encountering bad rule file.
Change-Id: I4729371be92c5659a6938145c5fde66771d7be22
2013-12-11 15:44:11 +01:00
Julius Volz fb44580110 Cleanup/fix program termination sequence.
Change-Id: I2bc58a2583fb079c9ef383cfc7a5e0fbe613f1cd
2013-12-11 15:40:32 +01:00
Julius Volz 740d448983 Use custom timestamp type for sample timestamps and related code.
So far we've been using Go's native time.Time for anything related to sample
timestamps. Since the range of time.Time is much bigger than what we need, this
has created two problems:

- there could be time.Time values which were out of the range/precision of the
  time type that we persist to disk, therefore causing incorrectly ordered keys.
  One bug caused by this was:

  https://github.com/prometheus/prometheus/issues/367

  It would be good to use a timestamp type that's more closely aligned with
  what the underlying storage supports.

- sizeof(time.Time) is 192, while Prometheus should be ok with a single 64-bit
  Unix timestamp (possibly even a 32-bit one). Since we store samples in large
  numbers, this seriously affects memory usage. Furthermore, copying/working
  with the data will be faster if it's smaller.

*MEMORY USAGE RESULTS*
Initial memory usage comparisons for a running Prometheus with 1 timeseries and
100,000 samples show roughly a 13% decrease in total (VIRT) memory usage. In my
tests, this advantage for some reason decreased a bit the more samples the
timeseries had (to 5-7% for millions of samples). This I can't fully explain,
but perhaps garbage collection issues were involved.

*WHEN TO USE THE NEW TIMESTAMP TYPE*
The new clientmodel.Timestamp type should be used whenever time
calculations are either directly or indirectly related to sample
timestamps.

For example:
- the timestamp of a sample itself
- all kinds of watermarks
- anything that may become or is compared to a sample timestamp (like the timestamp
  passed into Target.Scrape()).

When to still use time.Time:
- for measuring durations/times not related to sample timestamps, like duration
  telemetry exporting, timers that indicate how frequently to execute some
  action, etc.

*NOTE ON OPERATOR OPTIMIZATION TESTS*
We don't use operator optimization code anymore, but it still lives in
the code as dead code. It still has tests, but I couldn't get all of them to
pass with the new timestamp format. I commented out the failing cases for now,
but we should probably remove the dead code soon. I just didn't want to do that
in the same change as this.

Change-Id: I821787414b0debe85c9fffaeb57abd453727af0f
2013-12-03 09:11:28 +01:00
Julius Volz 1eb1ceac8c Add alert-expression console links to notifications.
The ConsoleLinkForExpression() function now escapes console URLs in such a way
that works both in emails and in HTML.

Change-Id: I917bae0b526cbbac28ccd2a4ec3c5ac03ee4c647
2013-08-20 15:45:41 +02:00
Julius Volz aa5d251f8d Use github.com/golang/glog for all logging. 2013-08-12 17:54:36 +02:00
Julius Volz 3b970c5133 Add variable interpolation to notification messages.
This includes required refactorings to enable replacing the http client (for
testing) and moving the NotificationReq type definitions to the "notifications"
package, so that this package doesn't need to depend on "rules" anymore and
that it can instead use a representation of the required data which only
includes the necessary fields.
2013-08-12 12:29:08 +02:00
Julius Volz 35ee2cd3cb Add alertmanager notification support to Prometheus.
Alert definitions now also have mandatory SUMMARY and DESCRIPTION fields
that get sent along a firing alert to the alert manager.
2013-07-30 17:23:41 +02:00
Matt T. Proud 30b1cf80b5 WIP - Snapshot of Moving to Client Model. 2013-06-25 15:52:42 +02:00
Julius Volz 0226d1ac7a Implement alerts dashboard and expression console links. 2013-06-13 22:35:40 +02:00
Julius Volz ba29d07901 Show loaded rules in Status dashboard. 2013-06-11 11:39:31 +02:00
Julius Volz adb87816f4 Put RuleManager concurrency in hands of caller, fix races. 2013-06-05 13:56:56 +02:00
Matt T. Proud c10780c966 Introduce telemetry for rule evaluator durations.
This commit adds telemetry for the Prometheus expression rule
evaluator, which will enable meta-Prometheus monitoring of customers
to ensure that no instance is falling behind in answering routine
queries.

A few other sundry simplifications are introduced, too.
2013-05-23 21:29:27 +02:00
Julius Volz 56324d8ce2 Make AST query storage non-global. 2013-05-07 13:15:10 +02:00
Julius Volz 9cea5d9df8 Convert the Prometheus configuration to protocol buffers. 2013-04-30 22:26:00 +02:00
Julius Volz d8110fcd9c Send sample arrays instead of single samples over channels. 2013-04-29 17:24:17 +02:00
Julius Volz 2202cd71c9 Track alerts over time and write out alert timeseries. 2013-04-26 14:35:21 +02:00
Julius Volz c0601abf46 Implement initial no-op alert parsing and rule parsing tests. 2013-04-23 13:48:24 +02:00
Julius Volz 1eb586db7d Fix rule evaluation closure. 2013-04-17 15:11:21 +02:00
Julius Volz c4d0969c00 Propagate more errors during rule evaluation. 2013-04-09 13:47:20 +02:00
Matt T. Proud ea54751431 Update import paths to new location.
This repository moved from matttproud/prometheus to
prometheus/prometheus, and all import paths need to be updated.
2013-01-27 18:49:45 +01:00
Julius Volz 483bd81a44 Allow grammar to parse both rules and single expressions. 2013-01-11 01:17:37 +01:00
Julius Volz 56384bf42a Add initial config and rule language implementation. 2013-01-07 23:43:36 +01:00