Commit graph

537 commits

Author SHA1 Message Date
Tobias Schmidt aaaba57184 Export number of missed rule evaluations
In case the execution of all rules takes longer than the configured rule
evaluation interval, one or more iterations will be skipped. This needs
to be visible to the opterator.
2017-04-02 20:03:28 -03:00
Fabian Reinartz 5772f1a7ba retrieval/storage: adapt to new interface
This simplifies the interface to two add methods for
appends with labels or faster reference numbers.
2017-02-02 13:05:46 +01:00
Fabian Reinartz ad9bc62e4c storage: extend appender and adapt it 2017-01-13 14:48:01 +01:00
Fabian Reinartz e94b0899ee rules: fix tests, remove model types 2016-12-29 17:31:14 +01:00
Fabian Reinartz f8fc1f5bb2 *: migrate ingestion to new batch Appender 2016-12-29 11:03:56 +01:00
Fabian Reinartz fecf9532b9 *: fix misc compile errors 2016-12-25 11:42:57 +01:00
Fabian Reinartz 622ece6273 *: fix recording tests, migrate matcher types 2016-12-25 11:12:57 +01:00
Fabian Reinartz 5817cb5bde *: migrate from model.* to promql.* types 2016-12-25 00:37:46 +01:00
Fabian Reinartz e68a3cf21f rules: update annotations on each iteration 2016-11-22 15:43:07 +01:00
Jonathan Lange d78dd3593d Set evaluation interval on Group construction
Prevents having object in invalid state, and allows users of public API
to construct valid Groups.
2016-11-18 16:32:30 +00:00
Jonathan Lange 31fc357cd8 Make NewGroup and Group.Eval public
Allows callers to execute evaluate lists of rules without first writing
them to disk.
2016-11-18 16:25:58 +00:00
Jonathan Lange 2a2da40223 Make rule evaluation publicly available
Means that a third-party can parse rules and run them with their own
execution model.
2016-11-18 16:12:50 +00:00
Matt Bostock 926a5ab3dd rules/manager.go: Fix race between reload and stop
On one relatively large Prometheus instance (1.7M series), I noticed
that upgrades were frequently resulting in Prometheus undergoing crash
recovery on start-up.

On closer examination, I found that Prometheus was panicking on
shutdown.

It seems that our configuration management (or misconfiguration thereof)
is reloading Prometheus then immediately restarting it, which I suspect
is causing this race:

    Sep 21 15:12:42 host systemd[1]: Reloading prometheus monitoring system.
    Sep 21 15:12:42 host prometheus[18734]: time="2016-09-21T15:12:42Z" level=info msg="Loading configuration file /etc/prometheus/config.yaml" source="main.go:221"
    Sep 21 15:12:42 host systemd[1]: Reloaded prometheus monitoring system.
    Sep 21 15:12:44 host systemd[1]: Stopping prometheus monitoring system...
    Sep 21 15:12:44 host prometheus[18734]: time="2016-09-21T15:12:44Z" level=warning msg="Received SIGTERM, exiting gracefully..." source="main.go:203"
    Sep 21 15:12:44 host prometheus[18734]: time="2016-09-21T15:12:44Z" level=info msg="See you next time!" source="main.go:210"
    Sep 21 15:12:44 host prometheus[18734]: time="2016-09-21T15:12:44Z" level=info msg="Stopping target manager..." source="targetmanager.go:90"
    Sep 21 15:12:52 host prometheus[18734]: time="2016-09-21T15:12:52Z" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:548"
    Sep 21 15:12:56 host prometheus[18734]: time="2016-09-21T15:12:56Z" level=warning msg="Error on ingesting out-of-order samples" numDropped=1 source="scrape.go:467"
    Sep 21 15:12:56 host prometheus[18734]: time="2016-09-21T15:12:56Z" level=error msg="Error adding file watch for \"/etc/prometheus/targets\": no such file or directory" source="file.go:84"
    Sep 21 15:12:56 host prometheus[18734]: time="2016-09-21T15:12:56Z" level=error msg="Error adding file watch for \"/etc/prometheus/targets\": no such file or directory" source="file.go:84"
    Sep 21 15:13:01 host prometheus[18734]: time="2016-09-21T15:13:01Z" level=info msg="Stopping rule manager..." source="manager.go:366"
    Sep 21 15:13:01 host prometheus[18734]: time="2016-09-21T15:13:01Z" level=info msg="Rule manager stopped." source="manager.go:372"
    Sep 21 15:13:01 host prometheus[18734]: time="2016-09-21T15:13:01Z" level=info msg="Stopping notification handler..." source="notifier.go:325"
    Sep 21 15:13:01 host prometheus[18734]: time="2016-09-21T15:13:01Z" level=info msg="Stopping local storage..." source="storage.go:381"
    Sep 21 15:13:01 host prometheus[18734]: time="2016-09-21T15:13:01Z" level=info msg="Stopping maintenance loop..." source="storage.go:383"
    Sep 21 15:13:01 host prometheus[18734]: panic: close of closed channel
    Sep 21 15:13:01 host prometheus[18734]: goroutine 7686074 [running]:
    Sep 21 15:13:01 host prometheus[18734]: panic(0xba57a0, 0xc60c42b500)
    Sep 21 15:13:01 host prometheus[18734]: /usr/local/go/src/runtime/panic.go:500 +0x1a1
    Sep 21 15:13:01 host prometheus[18734]: github.com/prometheus/prometheus/rules.(*Manager).ApplyConfig.func1(0xc6645a9901, 0xc420271ef0, 0xc420338ed0, 0xc60c42b4f0, 0xc6645a9900)
    Sep 21 15:13:01 host prometheus[18734]: /home/build/packages/prometheus/tmp/build/gopath/src/github.com/prometheus/prometheus/rules/manager.go:412 +0x3c
    Sep 21 15:13:01 host prometheus[18734]: created by github.com/prometheus/prometheus/rules.(*Manager).ApplyConfig
    Sep 21 15:13:01 host prometheus[18734]: /home/build/packages/prometheus/tmp/build/gopath/src/github.com/prometheus/prometheus/rules/manager.go:423 +0x56b
    Sep 21 15:13:03 host systemd[1]: prometheus.service: main process exited, code=exited, status=2/INVALIDARGUMENT
2016-09-21 22:03:02 +01:00
Julius Volz c187308366 storage: Contextify storage interfaces.
This is based on https://github.com/prometheus/prometheus/pull/1997.

This adds contexts to the relevant Storage methods and already passes
PromQL's new per-query context into the storage's query methods.
The immediate motivation supporting multi-tenancy in Frankenstein, but
this could also be used by Prometheus's normal local storage to support
cancellations and timeouts at some point.
2016-09-19 16:29:07 +02:00
Julius Volz ed5a0f0abe promql: Allow per-query contexts.
For Weaveworks' Frankenstein, we need to support multitenancy. In
Frankenstein, we initially solved this without modifying the promql
package at all: we constructed a new promql.Engine for every
query and injected a storage implementation into that engine which would
be primed to only collect data for a given user.

This is problematic to upstream, however. Prometheus assumes that there
is only one engine: the query concurrency gate is part of the engine,
and the engine contains one central cancellable context to shut down all
queries. Also, creating a new engine for every query seems like overkill.

Thus, we want to be able to pass per-query contexts into a single engine.

This change gets rid of the promql.Engine's built-in base context and
allows passing in a per-query context instead. Central cancellation of
all queries is still possible by deriving all passed-in contexts from
one central one, but this is now the responsibility of the caller. The
central query context is now created in main() and passed into the
relevant components (web handler / API, rule manager).

In a next step, the per-query context would have to be passed to the
storage implementation, so that the storage can implement multi-tenancy
or other features based on the contextual information.
2016-09-19 15:38:17 +02:00
beorn7 75bae065fd Revert "Modify tests to adjust to reverting the /graph changes"
This reverts commit f1ea5bf232.

Part two necessary for reverting the /graph revert.
2016-09-03 21:08:33 +02:00
beorn7 f1ea5bf232 Modify tests to adjust to reverting the /graph changes
These tests have been added after the /graph changes and therefore
already test the new syntax.

This commit has to be reverted together with the previous one to get
back to the old new state. *sigh*
2016-09-02 14:12:31 +02:00
Julius Volz fe7b8b7fd1 Add missing license header to alerting_test.go 2016-08-13 00:11:52 +02:00
Julius Volz da7206ec29 Fix rule HTML escaping issues
This was mentioned as part of https://github.com/prometheus/alertmanager/issues/452
2016-08-12 02:59:41 +02:00
Brian Brazil 6fc88d4b4d Remove __name__ from alerts sent to AM.
Fixes #1861
2016-08-01 23:32:41 +01:00
Dmitry Vorobev 273e457da4 web: return status code and error message for config resource 2016-07-15 10:15:24 +02:00
Brian Brazil 0509b0f2db Expand alert templates at eval time.
Fixes #1678 #1677
2016-07-12 17:13:55 +01:00
beorn7 064b57858e Consistently use the Seconds() method for conversion of durations
This also fixes one remaining case of recording integral numbers
of seconds only for a metric, i.e. this will probably fix #1796.
2016-07-07 15:24:35 +02:00
Fabian Reinartz f7ed2ff706 Merge pull request #1644 from prometheus/beorn7/logging
Add missing logging of out-of-order samples
2016-05-20 05:52:00 -07:00
beorn7 b95c096a45 Fix style issues in rules/... 2016-05-19 16:59:53 +02:00
beorn7 45e5775f9b Add missing logging of out-of-order samples
So far, out-of-order samples during rule evaluation were not logged,
and neither scrape health samples. The latter are unlikely to cause
any errors. That's why I'm logging them always now. (It's alway highly
irregular should it happen.) For rules, I have used the same plumbing
as for samples, just with a different wording in the message to mark
them as a result of rule evaluation.
2016-05-19 16:22:53 +02:00
beorn7 4b574e8a61 Switch chunk encoding to type 2 where it was hardcoded type 1 before
The chunk encoding was hardcoded there because it mostly doesn't
matter what encoding is chosen in that test. Since type 1 is
battle-hardened enough, I'm switching to type 2 here so that we can
catch unexpected problems as a byproduct. My expectation is that the
chunk encoding doesn't matter anyway, as said, but then "unexpected
problems" contains the word "unexpected".
2016-03-20 23:32:20 +01:00
Fabian Reinartz d89c254849 Make copying alerting state safer.
This considers static labels in the equality of alerts to
avoid falsely copying state from a different alert definition with
the same name across reloads.

To be safe, it also copies the state map rather than just its pointer
so that remaining collisions disappear after one evaluation interval.
2016-03-02 12:21:54 +01:00
Fabian Reinartz bfa8aaa017 Rename notification to notifier 2016-03-01 12:39:08 +01:00
beorn7 663a1550d0 Fix the instrumentation fixes 2016-02-17 15:50:55 +01:00
Tobias Schmidt f1f8317fa5 Fix detection of flapping alerts
Alerts in the resolve retention period must be transitioned to the
active state again when their condition is met.
2016-02-04 23:55:12 -05:00
Björn Rabenstein 9ea3897ea7 Merge pull request #1354 from prometheus/beorn7/storage
Rework the way to communicate backpressure (AKA suspended ingestion)
2016-02-01 15:10:13 +01:00
beorn7 ec08c9a391 Rework the way to communicate backpressure (AKA suspended ingestion)
This gives up on the idea to communicate throuh the Append() call (by
either not returning as it is now or returning an error as
suggested/explored elsewhere). Here I have added a Throttled() call,
which has the advantage that it can be called before a whole _batch_
of Append()'s. Scrapes will happen completely or not at all. Same for
rule group evaluations. That's a highly desired behavior (as discussed
elsewhere). The code is even simpler now as the whole ingestion buffer
could be removed.

Logging of throttled mode has been streamlined and will create at most
one message per minute.
2016-02-01 14:45:44 +01:00
beorn7 a7408bfb47 Unify duration parsing
It's actually happening in several places (and for flags, we use the
standard Go time.Duration...). This at least reduces all our
home-grown parsing to one place (in model).
2016-01-29 15:41:50 +01:00
Fabian Reinartz a6935024e1 Remove old WITH clause in alert printing 2016-01-26 15:45:27 +01:00
Fabian Reinartz b0adfea8d5 Fix swapped constants, improve instrumentation 2016-01-21 12:15:29 +01:00
Fabian Reinartz a8c38c3ac5 Don't log rule evaluation failure on shutdown 2016-01-18 17:34:25 +01:00
Fabian Reinartz 6eee86dce8 Terminate rule groups during initial sleep
When an evaluation group runs initially, it waits a deterministic
amount of time. During that time it also has to accept
a termination singnal so shutdown doesn't hang during the first
evaluation iteration after a configuration reload.

Fixes #1307
2016-01-12 10:54:09 +01:00
Fabian Reinartz 26eb3ac2f8 Don't skip recording rule errors 2016-01-12 10:26:06 +01:00
Fabian Reinartz 37d80c4b25 Fix premature rule evaluation
This commit prevents rule evaluation from starting until after
the storage is ready.
2016-01-08 17:51:22 +01:00
Fabian Reinartz 0cf3c6a9ef Add comments, rename a method 2015-12-23 12:29:28 +01:00
Fabian Reinartz bf6abac8f4 Send resolved notifications 2015-12-17 15:42:26 +01:00
Fabian Reinartz f69e668fc4 Improve rules/ instrumentation
This commit adds a counter for the total number of rule evaluations
and standardizes the units to seconds.
2015-12-17 15:42:26 +01:00
Fabian Reinartz 62075aa037 Reduce noisy no-alertmanager warning 2015-12-17 15:42:26 +01:00
Fabian Reinartz 52e5224f5a Refactor rules/ package 2015-12-17 15:42:25 +01:00
Fabian Reinartz e4fabe135a Set StartsAt to time of first firing state 2015-12-17 11:36:58 +01:00
Fabian Reinartz 7c90db22ed Use annotation based alerts in rules/
This commit breaks the previously used alert format.
2015-12-14 10:16:07 +01:00
Fabian Reinartz e114ce0ff7 Refactor notification handler 2015-12-11 15:17:32 +01:00
Fabian Reinartz e3b6ec9784 Switch to common/log 2015-10-03 10:21:43 +02:00
Fabian Reinartz 171f50706a Fix unkeyed field errors. 2015-09-18 17:00:08 +02:00
Brian Brazil 4d196fea6b Merge pull request #1032 from prometheus/scalar-metric
rules: Allow for setting labels on LHS on scalars
2015-08-26 16:56:16 +01:00
Brian Brazil 3bcdb2bbba rules: Allow for setting labels on LHS on scalars 2015-08-26 16:54:28 +01:00
Julius Volz 995d3b831d Fix most golint warnings.
This is with `golint -min_confidence=0.5`.

I left several lint warnings untouched because they were either
incorrect or I felt it was better not to change them at the moment.
2015-08-26 12:44:46 +02:00
Fabian Reinartz d6b8da8d43 Switch promql types to common/model 2015-08-25 13:49:14 +02:00
Brian Brazil fdf0d0642e Cast value to float, as that's what the console templates expect. 2015-08-24 16:59:08 +01:00
Fabian Reinartz 438e232c9b Fix grouping of import blocks 2015-08-22 09:42:45 +02:00
Fabian Reinartz 306e8468a0 Switch from client_golang/model to common/model 2015-08-21 13:33:38 +02:00
Brian Brazil e6a67476c2 rules: Allow recorded rules expressions to be scalars.
This is useful if you want to build up a constant metric,
such as a set of alert thresholds that vary by label value.
2015-08-19 21:09:00 +01:00
Fabian Reinartz 7a67472fc1 Resolve relative paths on configuration loading
This moves the concern of resolving the files relative to the config
file into the configuration loading itself.
It also fixes #921 which did not load the cert and token files relatively.
2015-08-05 18:08:04 +02:00
Fabian Reinartz feb8a03503 rules: load rule files relative to a base dir 2015-07-03 15:10:37 +02:00
Julius Volz fcff35b43e Consolidate external reachability flags into one.
Besides fixing https://github.com/prometheus/prometheus/issues/805 by
making the entire externally reachable server URL configurable, this
adds tests for the "globalURL" template function and makes it easier to
test other such functions in the future.

This breaks the `web.Hostname` flag (and introduces `web.external-url`).
This flag is likely only used by few users, so I hope that's
justifiable.

Fixes https://github.com/prometheus/prometheus/issues/805
2015-07-03 13:39:10 +02:00
Fabian Reinartz f06cf664e1 rules: cleanup alerting test 2015-06-30 18:22:24 +02:00
Fabian Reinartz 9bd4f6d017 rules: preserve alert state across reloads. 2015-06-30 11:32:07 +02:00
Fabian Reinartz 4625485b84 rules: move rules*.go contents to manager*.go 2015-06-30 11:32:07 +02:00
Fabian Reinartz 749ae450c5 promql: add runbook to alert statement.
This commit adds the RUNBOOK keyword to alert statements. The field
is optional and expected to be a link.
2015-06-25 13:00:52 +02:00
Julius Volz d868264bb8 Improve UI of /alerts page.
Changes to the UI:
- "Active Since" timestamps are now human-readable.
- Alerting rules are now pretty-printed better.
- Labels are no longer just strings, but alert bubbles (like we do on
  the status page for base labels).
- Alert states and target health states are now capitalized in the
  presentation layer rather than at the source.
2015-06-23 18:48:45 +02:00
Fabian Reinartz fe301d7946 promql: remove global flags 2015-06-15 19:01:06 +02:00
Fabian Reinartz 5e13880201 General cleanup of rules. 2015-06-06 21:40:52 +02:00
Fabian Reinartz 75c920c95e Remove DotGraph method from Rule interface 2015-06-06 21:35:59 +02:00
Fabian Reinartz 83d07516e8 Remove EvalRaw methods from Rule interface 2015-06-06 21:34:09 +02:00
Fabian Reinartz 280d11dca8 main: exit on invalid rule files on startup. 2015-06-02 18:44:41 +02:00
Fabian Reinartz 0de6edbdfc Move pkg/ to util/ 2015-06-01 21:12:32 +02:00
Fabian Reinartz 02717e6fde Remove generic set type 2015-06-01 21:12:32 +02:00
Fabian Reinartz dbc0d30e3e Move string functionality to pkg/strutil 2015-06-01 21:12:32 +02:00
Fabian Reinartz f45a5cab60 Move templates package to pkg/template 2015-06-01 21:12:31 +02:00
Fabian Reinartz c44ac7bc26 Load rule files from entire directories 2015-06-01 21:12:31 +02:00
Julius Volz d7c015c149 Convert pathPrefix to not have trailing slash. 2015-06-01 12:43:17 +02:00
Julius Volz ff53d10849 Fix double slash in GeneratorURL sent to alertmanager.
Fixes https://github.com/prometheus/prometheus/issues/722
2015-05-23 19:16:57 +02:00
Julius Volz 267fd34156 Switch Prometheus to use github.com/prometheus/log.
This change is conceptually very simple, although the diff is large. It
switches logging from "github.com/golang/glog" to
"github.com/prometheus/log", while not actually changing any log
messages. V(1)-style logging has been changed to be log.Debug*().
2015-05-20 18:19:32 +02:00
Fabian Reinartz e2ed921505 Merge branch 'master' into fabxc/servdisc 2015-05-20 14:13:08 +02:00
Mitsuhiro Tanda 3e914a8cb1 fix graph links with path prefix 2015-05-19 02:45:05 +09:00
Fabian Reinartz bb540fd9fd Implement config reloading on SIGHUP.
With this commit, sending SIGHUP to the Prometheus process will reload
and apply the configuration file. The different components attempt
to handle failing changes gracefully.
2015-05-13 16:49:46 +02:00
Fabian Reinartz fe935179cd Stop routing rule statements through the engine. 2015-04-29 18:01:43 +02:00
Fabian Reinartz 8d7c479fed Merge pull request #658 from prometheus/fabxc/pql/rules-manager
Rename RuleManager to Manager, remove interface.
2015-04-29 16:54:21 +02:00
Fabian Reinartz 479891c9be Rename RuleManager to Manager, remove interface.
This commits renames the RuleManager to Manager as the package
name is 'rules' now. The unused layer of abstraction of the
RuleManager interface is removed.
2015-04-29 16:42:10 +02:00
Fabian Reinartz 25cdff3527 Remove name arg from Parse* functions, enhance parsing errors. 2015-04-29 16:38:41 +02:00
Fabian Reinartz 3ca11bcaf5 Switch Prometheus to promql package.
This commit removes all functionality from rules/ that is now handled in
promql/.
All parts of Prometheus are changed to use the promql/ package.
2015-04-28 16:19:23 +02:00
Ceesjan Luiten 0e18784c64 Make all paths absolute to support proxies 2015-04-02 20:36:47 +02:00
Brian Brazil 941f585164 Avoid +InfYs and similar, just display +Inf. 2015-03-28 18:51:41 +00:00
beorn7 a075900f9a Merge branch 'beorn7/persistence' into beorn7/ingestion-tweaks 2015-03-18 19:09:31 +01:00
Fabian Reinartz 624f27f4b6 Add ln, log2, log10 and exp functions to the query language. 2015-03-16 18:26:19 +01:00
Julius Volz b2651027fc Fix special value handling in division and modulo.
This fixes https://github.com/prometheus/prometheus/issues/597
2015-03-16 14:23:40 +01:00
beorn7 be11cb2b07 Remove the sample ingestion channel.
The one central sample ingestion channel has caused a variety of
trouble. This commit removes it. Targets and rule evaluation call an
Append method directly now. To incorporate multiple storage backends
(like OpenTSDB), storage.Tee forks the Append into two different
appenders.

Note that the tsdb queue manager had its own queue anyway. It was a
queue after a queue... Much queue, so overhead...

Targets have their own little buffer (implemented as a channel) to
avoid stalling during an http scrape. But a new scrape will only be
started once the old one is fully ingested.

The contraption of three pipelined ingesters was removed. A Target is
an ingester itself now. Despite more logic in Target, things should be
less confusing now.

Also, remove lint and vet warnings in ast.go.
2015-03-15 14:08:22 +01:00
beorn7 13fcf1ddbc Implement double-delta encoded chunks. 2015-03-05 20:33:26 +01:00
beorn7 9e85ab0eef Apply the new signature/fingerprinting functions from client_golang.
This requires the new version of client_golang (vendoring will follow
in the next commit), which changes the fingerprinting for
clientmodel.Metric.
2015-03-03 18:34:01 +01:00
Fabian Reinartz 182de6b99f Fix unary +/- expressions.
Unary expressions cause parsing errors if they are done in the lexer
by tokenizing them into the number.
This fix moves unary expressions to the parser.
2015-03-03 13:30:08 +01:00
Fabian Reinartz 6f754073d5 Add OR operation and vector matching options.
This commits implements the OR operation between two vectors.
Vector matching using the ON clause is added to limit the set of
labels that define a match between two elements. Group modifiers
(GROUP_LEFT/GROUP_RIGHT) to request many-to-one matching are added.
2015-03-03 11:35:10 +01:00
Julius Volz 0ac931aed1 Also support parsing float formats like "2.". 2015-03-02 12:58:05 +01:00
Julius Volz c2ab54e9a6 Support scientific notation and special float values.
This adds support for scientific notation in the expression language, as
well as for all possible literal forms of +Inf/-Inf/NaN.

TODO: Keep enough state in the parser/lexer to distinguish contexts in
which "Inf", "NaN", etc. should be parsed as a number vs. parsed as a
label name. Currently, foo{nan="bar"} would be a syntax error. However,
that is an existing bug for all our reserved words. E.g. foo{sum="bar"}
is a syntax error as well. This should be fixed separately.
2015-03-01 19:31:16 +01:00
beorn7 1a61bcae07 Fix plural of 'histogram'.
Actually, 'histogram' is Ancient Greek and 3rd declension... ;-)
2015-02-23 15:29:26 +01:00
beorn7 17443d288b Avoid copying of the COWMetric if we already have the metric available. 2015-02-22 01:04:52 +01:00
beorn7 9e7c3e3bcd Add the histogram_quantile function.
Since we are now getting really deep into floating point calculation,
the tests had to take into account the precision loss. Since the rule
tests are based on direct line matching in the output, implementing
the "almost equal" semantics was pretty cumbersome, but here we are.
2015-02-22 01:04:51 +01:00
Julius Volz 42601acfde Replace labelsToKey() with metric Fingerprint (fixes grouping bug). 2015-02-21 17:45:47 +01:00
Julius Volz 7fefccd929 Write() directly into hash and use model.SeparatorByte. 2015-02-21 17:19:13 +01:00
Julius Volz 645cf57bed Fix aggregation grouping key calculation. 2015-02-21 14:05:50 +01:00
Julius Volz 15b2b5aa66 Add tests for invalid uses of "offset". 2015-02-18 02:56:40 +01:00
Julius Volz 67e20acc6c Lower-case some package-internal names. 2015-02-18 02:45:54 +01:00
Julius Volz 72d7b325a1 Implement offset operator.
This allows changing the time offset for individual instant and range
vectors in a query.

For example, this returns the value of `foo` 5 minutes in the past
relative to the current query evaluation time:

    foo offset 5m

Note that the `offset` modifier always needs to follow the selector
immediately. I.e. the following would be correct:

    sum(foo offset 5m) // GOOD.

While the following would be *incorrect*:

    sum(foo) offset 5m // INVALID.

The same works for range vectors. This returns the 5-minutes-rate that
`foo` had a week ago:

    rate(foo[5m] offset 1w)

This change touches the following components:

* Lexer/parser: additions to correctly parse the new `offset`/`OFFSET`
  keyword.
* AST: vector and matrix nodes now have an additional `offset` field.
  This is used during their evaluation to adjust query and result times
  appropriately.
* Query analyzer: now works on separate sets of ranges and instants per
  offset. Isolating different offsets from each other completely in this
  way keeps the preloading code relatively simple.

No storage engine changes were needed by this change.

The rules tests have been changed to not probe the internal
implementation details of the query analyzer anymore (how many instants
and ranges have been preloaded). This would also become too cumbersome
to test with the new model, and measuring the result of the query should
be sufficient.

This fixes https://github.com/prometheus/prometheus/issues/529
This fixed https://github.com/prometheus/promdash/issues/201
2015-02-18 02:41:27 +01:00
Brian Brazil 60271d58bf Change the 2nd argument of round to toNearest.
This is more useful if you want get a multiple of 2 or 5, while
still working for .001.
2015-02-05 16:13:40 +00:00
Julius Volz 82613527f3 Remove unnecessary float64() conversion in round(). 2015-02-05 15:14:05 +01:00
Marko Mikulicic 8fdacbdf17 Add floor, ceil and round functions. Closes #402 2015-02-04 17:20:56 +01:00
Fabian Reinartz fa1e90003b Query timeout added.
This is related to #454. Queries now timeout after a duration set by
the -query.timeout flag. The TotalEvalTimer is now started/stopped
inside any of the ast.Eval* functions.
2015-02-03 08:04:27 +01:00
Bjoern Rabenstein 26e22e6ad6 Fix rule manager shutdown. 2015-01-29 15:05:10 +01:00
Julius Volz d4374a9265 More efficient JSON query result format.
This depends on https://github.com/prometheus/client_golang/pull/51.

For vectors, the result format looks like this:

```json
{
   "version": 1,
   "type" : "vector",
   "value" : [
      {
         "timestamp" : 1421765411.045,
         "value" : "65.475000",
         "metric" : {
            "quantile" : "0.5",
            "instance" : "http://localhost:9090/metrics",
            "job" : "prometheus",
            "__name__" : "http_request_duration_microseconds",
            "handler" : "/static/",
            "method" : "get",
            "code" : "304"
         }
      },
      {
         "timestamp" : 1421765411.045,
         "value" : "5826.339000",
         "metric" : {
            "quantile" : "0.9",
            "instance" : "http://localhost:9090/metrics",
            "job" : "prometheus",
            "__name__" : "http_request_duration_microseconds",
            "handler" : "prometheus",
            "method" : "get",
            "code" : "200"
         }
      },
      /* ... */
   ]
}
```

For matrices, it looks like this:

```json
{
   "version": 1,
   "type" : "matrix",
   "value" : [
      {
         "metric" : {
            "quantile" : "0.99",
            "instance" : "http://localhost:9090/metrics",
            "job" : "prometheus",
            "__name__" : "http_request_duration_microseconds",
            "handler" : "/static/",
            "method" : "get",
            "code" : "200"
         },
         "values" : [
            [
               1421765547.659,
               "29162.953000"
            ],
            [
               1421765548.659,
               "29162.953000"
            ],
            [
               1421765549.659,
               "29162.953000"
            ],
            /* ... */
         ]
      }
   ]
}
```
2015-01-26 13:06:22 +01:00
Brian Brazil a31730e88b Make 2nd arg to delta optional. Add a deriv() function.
The 2nd isCounter argument to delta is ugly, make it optional as the first step
of deprecating it. This will makes delta only ever applied to gauges.

Add a deriv function to calculate the least squares
slope of a gauge. This is more useful for prediction than delta,
as it isn't as heavily influenced by outliers at the boundaries.
2015-01-23 14:50:27 +00:00
Bjoern Rabenstein 5859b74f1b Clean up license issues.
- Move CONTRIBUTORS.md to the more common AUTHORS.
- Added the required NOTICE file.
- Changed "Prometheus Team" to "The Prometheus Authors".
- Reverted the erroneous changes to the Apache License.
2015-01-21 20:07:45 +01:00
Bjoern Rabenstein b09453af1d Adjust to new client_golang API. 2015-01-21 15:42:25 +01:00
Julius Volz bb1e49383e Log rule evalation errors. 2015-01-08 17:50:55 +01:00
Julius Volz d6b9e97655 Remove extraction.Result type, simplify code. 2015-01-08 16:34:01 +01:00
Julius Volz 9a4ca68a61 Add metrics for rule evaluation failures.
Fixes https://github.com/prometheus/prometheus/issues/417
2015-01-08 16:33:35 +01:00
Brian Brazil ffa2e73803 Fix regression from 5e8d57bec1
0 is a false value, so shortcutting no longer works.
Update other places in the code that assumed graph was the default.
2014-12-27 00:28:36 +00:00
Julius Volz cc27fb8aab Rename remaining all-caps constants in AST layer.
Change-Id: Ibe97e30981969056ffcdb89e63c1468ea1ffa140
2014-12-25 01:30:47 +01:00
Julius Volz 895523ad14 Include necessary Makefile.INCLUDE from rules/Makefile.
Change-Id: I077d018dbe4093cd40ddf38d66a996df222bf5e4
2014-12-25 01:13:59 +01:00
Julius Volz 2ade9d40cf Clarify why we need int constants for expression types.
Change-Id: I053fc5d32c118dbdb204dc8193337f981aff796e
2014-12-25 00:45:30 +01:00
Julius Volz 00a2a93a05 Add regression tests for metrics mutations in AST.
It turned out in the end, that only drop_common_metrics() produced any
erroneous output in the old system. The second expression in the test
("sum(testmetric) keeping_extra") already worked in the old code, but
why not keep it in...

The way to test ranged evaluations is a bit clumsy so far, so I want to
build a nicer test framework in the end, where all the test cases can be
specified as text files which specify desired inputs, outputs, query
step widths, etc.

Change-Id: I821859789e69b8232bededf670a1b76e9e8c8ca4
2014-12-12 20:34:55 +01:00
Julius Volz c9618d11e8 Introduce copy-on-write for metrics in AST.
This depends on changes in:

https://github.com/prometheus/client_golang/tree/cow-metrics.

Change-Id: I80b94833a60ddf954c7cd92fd2cfbebd8dd46142
2014-12-12 20:34:55 +01:00
Bjoern Rabenstein b1e4956142 Apply a giant code cleanup.
Essentially:

- Remove unused code.

- Make it 'go vet' clean. The only remaining warnings are in generated code.

- Make it 'golint' clean. The only remaining warnings are in gerenated code.

- Smoothed out same minor things.

Change-Id: I3fe5c1fbead27b0e7a9c247fee2f5a45bc2d42c6
2014-12-10 16:16:49 +01:00
Bjoern Rabenstein fee88a7a77 Remove the remaining races, new and old.
Also, resolve a few other TODOs.

Change-Id: Icb39b5a5e8ca22ebcb48771cd8951c5d9e112691
2014-12-03 18:07:23 +01:00
Bjoern Rabenstein 7d11019aa2 Squash a few trivial TODOs.
- Delete unneeded file view_adapter.go.
- Assessed that we still need the fingerprints in nodes
  (to create iterators).
- Turned numMemChunkDescs into a metric.

Change-Id: I29be963c795a075ec00c095f76bf26405535609d
2014-11-27 18:26:06 +01:00
Julius Volz 6eecee55b7 Fix acronym caps in GeneratorURL.
Change-Id: Ib18c1f617dcde1039e848059545a6d8831d9bf66
2014-11-25 17:13:04 +01:00
Bjoern Rabenstein 0ae1d8889a Fix tests after merge.
Change-Id: Ia90da9a3e48ed780ec38c4a6a1fd9ea34e7f6a58
2014-11-25 17:13:04 +01:00
Julius Volz b7bf11230a Add absent() function.
A common problem in Prometheus alerting is to detect when no timeseries
exist for a given metric name and label combination. Unfortunately,
Prometheus alert expressions need to be of vector type, and
"count(nonexistent_metric)" results in an empty vector, yielding no
output vector elements to base an alert on. The newly introduced
absent() function solves this issue:

  ALERT FooAbsent IF absent(foo{job="myjob"}) [...]

absent() has the following behavior:

- if the vector passed to it has any elements, it returns an empty
  vector.

- if the vector passed to it has no elements, it returns a 1-element
  vector with the value 1.

In the second case, absent() tries to be smart about deriving labels of
the 1-element output vector from the input vector:

  absent(nonexistent{job="myjob"})                   => {job="myjob"}
  absent(nonexistent{job="myjob",instance=~".*"})    => {job="myjob"}
  absent(sum(nonexistent{job="myjob"}))              => {}

That is, if the passed vector is a literal vector selector, it takes all
"=" label matchers as the basis for the output labels, but ignores all
non-equals or regex matchers. Also, if the passed vector results from a
non-selector expression, no labels can be derived.

Change-Id: I948505a1488d50265ab5692a3286bd7c8c70cd78
2014-11-25 17:13:04 +01:00
Julius Volz 3d47f94149 Drop metric names after transformations.
After many transformations, it doesn't make sense to keep the metric
names, since the result of the transformation is no longer that metric.
This drops the metric name after such transformations and makes the web
UI deal well with missing metric names.

This depends on the current branch on the following things:

- prometheus/client_golang needs to be at
  e237cf15c6
  in branch "julius/int-fingerprints" (to be merged with new storage)

- prometheus/promdash needs to be at
  dd7691c9c2

Change-Id: Ib3c8cad8d647d9854e8c653c424b8c235ccc231d
2014-11-25 17:13:04 +01:00
Bjoern Rabenstein 14bda4180c Changes after pair code review.
Change-Id: Ib72d40f8e9027818cfbbd32a7a7201eebda07455
2014-11-25 17:12:59 +01:00
Bjoern Rabenstein 006b5517e2 Simplify makefiles.
This removes the dependancy on C leveldb and snappy.
It also takes care of fewer dependencies as they would
anyway not work on any non-Debian, non-Brew system.

Change-Id: Ia70dce1ba8a816a003587927e0b3a3f8ad2fd28c
2014-11-25 17:10:39 +01:00
Bjoern Rabenstein 74c143c4c9 Improve scraper shutdown time.
- Stop target pools in parallel.
- Stop individual scrapers in goroutines, too.
- Timing tweaks.

Change-Id: I9dff1ee18616694f14b04408eaf1625d0f989696
2014-11-25 17:10:39 +01:00
Julius Volz 0712d738d1 Allow alternative "by"-clause position in grammar.
In addition to the existing by-clause syntax:

  sum(<expression>) by (<labels>) [keeping_extra]

...this allows the following new syntax:

  sum by (<labels>) [keeping_extra] (<expression>)

Both orderings may be used in a single expression. It is up to the users
to establish guidelines around their usage.

Change-Id: Iba10c9cc5fb6ac62edfcf246d281473e82467992
2014-11-25 17:09:04 +01:00
Julius Volz 0e48c18bbf Allow omitting the metric name in queries.
This allows the following expression syntaxes for selecting timeseries:

  foo                    (already valid before)
  foo{}                  (already valid before)
  {job="prometheus"}     (new, select all timeseries for job "prometheus")

Omitting both the metric name *and* any label matchers ("" or "{}") will
still yield a syntax error.

To get all timeseries, you could do:

    {__name__=~".*"}
       or, without relying on knowledge about __metric__:
    {job=~".*"}

Change-Id: Ifee000b9ac0184ef6ced18411069c7f2699a2dda
2014-11-25 17:09:04 +01:00
Bjoern Rabenstein 096fa0f8b2 Squash a number of TODOs.
- Staleness delta is no a proper function parameter and not replicated
  from package ast.

- Named type 'chunks' replaced by explicit '[]chunk' to avoid confusion.

- For the same reason, replaced 'chunkDescs' by '[]*chunkDescs'.

- Verified that math.Modf is not a speed enhancement over conversion
  (actually 5x slower).

- Renamed firstTimeField, lastTimeField into chunkFirstTime and
  chunkLastTime.

- Verified unpin() is sufficiently goroutine-safe.

- Decided not to update archivedFingerprintToTimeRange upon series
  truncation and added a rationale why.

Change-Id: I863b8d785e5ad9f71eb63e229845eacf1bed8534
2014-11-25 17:09:04 +01:00
Bjoern Rabenstein b3ed9aa7a2 Clean up start-up and shut-down.
Change-Id: Idff4bbb0a15a9f879bfbb3da5b1025179cab5e2c
2014-11-25 17:08:45 +01:00
Bjoern Rabenstein 38fc24d0ed Fix targetpool_test.go and other tests.
Change-Id: I91a4dd1d39e01f174e1aaae653ce1ed7aecaa624
2014-11-25 17:08:26 +01:00
Julius Volz 7f5d3c2c29 Fix and improve the fp locker.
Benchmark:
$ go test -bench 'Fingerprint' -test.run 'Fingerprint' -test.cpu=1,2,4

OLD
BenchmarkFingerprintLockerParallel        500000              3618 ns/op
BenchmarkFingerprintLockerParallel-2      100000             12257 ns/op
BenchmarkFingerprintLockerParallel-4      500000             10164 ns/op
BenchmarkFingerprintLockerSerial        10000000               283 ns/op
BenchmarkFingerprintLockerSerial-2      10000000               284 ns/op
BenchmarkFingerprintLockerSerial-4      10000000               288 ns/op

NEW
BenchmarkFingerprintLockerParallel       1000000              1018 ns/op
BenchmarkFingerprintLockerParallel-2     1000000              1164 ns/op
BenchmarkFingerprintLockerParallel-4     2000000               910 ns/op
BenchmarkFingerprintLockerSerial        50000000                56.0 ns/op
BenchmarkFingerprintLockerSerial-2      50000000                47.9 ns/op
BenchmarkFingerprintLockerSerial-4      50000000                54.5 ns/op

Change-Id: I3c65a43822840e7e64c3c3cfe759e1de51272581
2014-11-25 17:07:45 +01:00
Julius Volz 358f97791d Minor cleanups.
Change-Id: Ia8685d8439a421fe2143d9ec7120d5bb5ab88d78
2014-11-25 17:07:44 +01:00
Bjoern Rabenstein f5f9f3514a Major code cleanup.
- Make it go-vet and golint clean.
- Add comments, TODOs, etc.

Change-Id: If1392d96f3d5b4cdde597b10c8dff1769fcfabe2
2014-11-25 17:02:53 +01:00
Julius Volz e7ed39c9a6 Initial experimental snapshot of next-gen storage.
Change-Id: Ifb8709960dbedd1d9f5efd88cdd359ee9fa9d26d
2014-11-25 17:02:00 +01:00
Julius Volz 85497e3f38 Add function to drop common labels in a vector.
This fixes https://github.com/prometheus/prometheus/issues/384.

Change-Id: I2973c4baeb8a4618ec3875fb11c6fcf5d111784b
2014-11-25 17:02:00 +01:00
Julius Volz 3fdb74e571 Add more topk() / bottomk() tests.
Test what happens if k > number of input elements.

Change-Id: Ie724b850939e297ebf085f0a5a3522e9cfcc6534
2014-11-25 17:02:00 +01:00
Julius Volz c582ae73c2 Implement topk() and bottomk() functions.
To achieve O(log n * k) runtime, this uses a heap to track the current
bottom-k or top-k elements while iterating over the full set of
available elements.

It would be possible to reuse more code between topk and bottomk, but I
decided for some more duplication for the sake of clarity.

This fixes https://github.com/prometheus/prometheus/issues/399

Change-Id: I7487ddaadbe7acb22ca2cf2283ba6e7915f2b336
2014-11-25 17:02:00 +01:00
Bjoern Rabenstein 1909686789 Make metrics exported by the Prometheus server itself more consistent.
- Always spell out the time unit (e.g. milliseconds instead of ms).

- Remove "_total" from the names of metrics that are not counters.

- Make use of the "Namespace" and "Subsystem" fields in the options.

- Removed the "capacity" facet from all metrics about channels/queues.
  These are all fixed via command line flags and will never change
  during the runtime of a process. Also, they should not be part of
  the same metric family. I have added separate metrics for the
  capacity of queues as convenience. (They will never change and are
  only set once.)

- I left "metric_disk_latency_microseconds" unchanged, although that
  metric measures the latency of the storage device, even if it is not
  a spinning disk. "SSD" is read by many as "solid state disk", so
  it's not too far off. (It should be "solid state drive", of course,
  but "metric_drive_latency_microseconds" is probably confusing.)

- Brian suggested to not mix "failure" and "success" outcome in the
  same metric family (distinguished by labels). For now, I left it as
  it is. We are touching some bigger issue here, especially as other
  parts in the Prometheus ecosystem are following the same
  principle. We still need to come to terms here and then change
  things consistently everywhere.

Change-Id: If799458b450d18f78500f05990301c12525197d3
2014-11-25 17:02:00 +01:00
Julius Volz 00b9489f1c Fix time() behavior.
time() should return the timestamp for which the query is executed, not
the actual current time.

Change-Id: I430a45cabad7785cd58f95b1028a71dff4c87710
2014-11-25 17:02:00 +01:00
Julius Volz c5984f1818 Add abs() and over-time aggregation functions.
This implements aggregation functions over time as request in
https://github.com/prometheus/prometheus/issues/383.

Change-Id: Ifd69b850de8cfdf6e7a6c0e042056fa4c672410e
2014-11-25 17:02:00 +01:00
Brian Brazil f525ca5d9e Let consoles get graph links from experssions.
Rename ConsoleLinkFromExpression, as we now have consoles.

Change-Id: I7ed2c9c83863adb390b51121dd9736845f7bcdfc
2014-11-25 17:01:59 +01:00
Bjoern Rabenstein 8956faeccb Migrate to new client_golang.
This change will only be submitted when the new client_golang has been
moved to the new version.

Change-Id: Ifceb59333072a08286a8ac910709a8ba2e3a1581
2014-11-25 17:01:59 +01:00
Brian Brazil 960ede66dc Use html/template for console templates and add template libary support.
Add a function to bypass the new auto-escaping.
Add a function to workaround go's templates only allowing passing in one argument.

Change-Id: Id7aa3f95e7c227692dc22108388b1d9b1e2eec99
2014-11-25 17:01:59 +01:00
Brian Brazil e041c0cd46 Add console and alert templates with access to all data.
Move rulemanager to it's own package to break cicrular dependency.
Make NewTestTieredStorage available to tests, remove duplication.

Change-Id: I33b321245a44aa727bfc3614a7c9ae5005b34e03
2014-05-30 16:24:56 +01:00
Bjoern Rabenstein ca6a4fccef Weed out our homegrown test.Tester.
The Go stdlib has testing.TB now, which fulfills the exact same
purpose.

Change-Id: I0db9c73400e208ca376b932a02b7e3402234b87c
2014-05-21 19:27:24 +02:00
Julius Volz 6297a405f2 Do not indent API JSON responses.
In one example response, this reduced the uncompressed size by 25% and
the gzipped size by 11%.

Change-Id: Ie80d44253124b9f8601b8ef9fc978e92dacff523
2014-04-22 15:16:37 +02:00
Julius Volz 01f652cb4c Separate storage implementation from interfaces.
This was initially motivated by wanting to distribute the rule checker
tool under `tools/rule_checker`. However, this was not possible without
also distributing the LevelDB dynamic libraries because the tool
transitively depended on Levigo:

rule checker -> query layer -> tiered storage layer -> leveldb

This change separates external storage interfaces from the
implementation (tiered storage, leveldb storage, memory storage) by
putting them into separate packages:

- storage/metric: public, implementation-agnostic interfaces
- storage/metric/tiered: tiered storage implementation, including memory
                         and LevelDB storage.

I initially also considered splitting up the implementation into
separate packages for tiered storage, memory storage, and LevelDB
storage, but these are currently so intertwined that it would be another
major project in itself.

The query layers and most other parts of Prometheus now have notion of
the storage implementation anymore and just use whatever implementation
they get passed in via interfaces.

The rule_checker is now a static binary :)

Change-Id: I793bbf631a8648ca31790e7e772ecf9c2b92f7a0
2014-04-16 13:30:19 +02:00
Julius Volz d411a7d810 Allow reversing vector and scalar arguments in binops.
This allows putting a scalar as the first argument of a binary operator
in which the second argument is a vector:

  <scalar> <binop> <vector>

For example,

  1 / http_requests_total

...will output a vector in which every sample value is 1 divided by the
respective input vector element.

This even works for filter binary operators now:

  1 == http_requests_total

Returns a vector with all values set to 1 for every element in
http_requests_total whose initial value was 1.

Note: For filter binary operators, the resulting values are always taken
from the left-hand-side of the operation, no matter whether the scalar
or the vector argument is the left-hand-side. That is,

  1 != http_requests_total

...will set all result vector sample values to 1, although these are
exactly the sample elements that were != 1 in the input vector.

If you want to just filter elements without changing their sample
values, you still need to do:

  http_requests_total != 1

The new filter form is a bit exotic, and so probably won't be used
often. But it was easier to implement it than disallow it completely or
change its behavior.

Change-Id: Idd083f2bd3a1219ba1560cf4ace42f5b82e797a5
2014-04-08 17:16:18 +02:00
Julius Volz c7c0b33d0b Add regex-matching support for labels.
There are four label-matching ops for selecting timeseries now:

- Equal: =
- NotEqual: !=
- RegexMatch: =~
- RegexNoMatch: !~

Instead of looking up labels by a simple clientmodel.LabelSet (basically
an equals op for every key/value pair in the set), timeseries
fingerprint selection is now done via a list of metric.LabelMatchers.

Change-Id: I510a83f761198e80946146770ebb64e4abc3bb96
2014-04-01 14:24:53 +02:00
Bjoern Rabenstein 0a65b691cc Disallow ":" in identifiers, but still allow it in metric names.
Change-Id: Iace925ab1b71a360bd63357e87f68e727f7afbcb
2014-03-21 13:44:37 +01:00
Julius Volz 86fc13a52e Convert metric.Values to slice of values.
The initial impetus for this was that it made unmarshalling sample
values much faster.

Other relevant benchmark changes in ns/op:

Benchmark                                 old        new   speedup
==================================================================
BenchmarkMarshal                       179170     127996     1.4x
BenchmarkUnmarshal                     404984     132186     3.1x

BenchmarkMemoryGetValueAtTime           57801      50050     1.2x
BenchmarkMemoryGetBoundaryValues        64496      53194     1.2x
BenchmarkMemoryGetRangeValues           66585      54065     1.2x

BenchmarkStreamAdd                       45.0       75.3     0.6x
BenchmarkAppendSample1                   1157       1587     0.7x
BenchmarkAppendSample10                  4090       4284     0.95x
BenchmarkAppendSample100                45660      44066     1.0x
BenchmarkAppendSample1000              579084     582380     1.0x
BenchmarkMemoryAppendRepeatingValues 22796594   22005502     1.0x

Overall, this gives us good speedups in the areas where they matter
most: decoding values from disk and accessing the memory storage (which
is also used for views).

Some of the smaller append examples take minimally longer, but the cost
seems to get amortized over larger appends, so I'm not worried about
these. Also, we're currently not bottlenecked on the write path and have
plenty of other optimizations available in that area if it becomes
necessary.

Memory allocations during appends don't change measurably at all.

Change-Id: I7dc7394edea09506976765551f35b138518db9e8
2014-03-11 18:23:37 +01:00
Julius Volz bc6ee6611e Rename persistence_adapter.go -> view_adapter.go
Change-Id: Ib45081393b734531d2f85a02f46e87930aab3273
2014-02-22 22:43:11 +01:00
Julius Volz 3f226c9724 Rename {Scalar,Vector}Literal to {Scalar,Vector}Selector.
Change-Id: Ie92301f47f5f49f30b3a62c365e377108982b080
2014-02-22 22:33:42 +01:00
Bjoern Rabenstein 682cf6fc51 Simplify QueryAnalizer.Visit().
Change-Id: I628582a1903b7273e78921e22a475f1dae5ebaae
2014-02-14 15:15:57 +01:00
Bjoern Rabenstein fd63500ed3 Make rules/ast golint clean.
Mostly, that means adding compliant doc strings to exported items.

Also, remove 'go vet' warnings where possible. (Some are unfortunately
not to avoid, arguably bugs in 'go vet'.)

Change-Id: I2827b6dd317492864c1383c3de1ea9eac5a219bb
2014-02-14 15:01:39 +01:00
Björn Rabenstein 59febe771a Merge "Minor code cleanups." 2014-02-13 15:29:16 +01:00
Julius Volz c4adfc4f25 Minor code cleanups.
Change-Id: Ib3729cf38b107b7f2186ccf410a745e0472e3630
2014-02-13 15:24:43 +01:00
Julius Volz 7e9ecaac3a Add count_scalar() function.
Change-Id: I63f09dd0479d0a6b016f5f857dd39dcbda56c7f9
2014-01-30 13:07:26 +01:00
Julius Volz 0378c2ca1f Nonexistent labels in BY-clauses shouldn't propagate to result.
This fixes bug 2. of https://github.com/prometheus/prometheus/issues/374

Change-Id: Ia4a13153616bafce5bf10597966b071434422d09
2014-01-24 16:05:30 +01:00
Julius Volz 6dc36d0c3e Don't keep extra labels in aggregations by default.
MIN/MAX/SUM/AVG/COUNT aggregations will now by default drop all labels that are
not specifically part of a BY-clause, even if a label value is the same within
all timeseries of an aggregation group. The old behavior of keeping extra
labels may still be switched on by adding KEEPING_EXTRA to the end of an
aggregation statement:

  sum(http_requests) by (job, method) keeping_extra

I'm open to better syntax/naming suggestions.

Change-Id: I21d3fe7af9e98552ce3dffa3ce7c0a4ba4c0b4a4
2013-12-16 12:53:10 +01:00
Julius Volz 20bfaf80ab Merge "Display filename when encountering bad rule file." 2013-12-13 15:01:02 +01:00
Julius Volz 3bf3a555b2 Merge "add evalDuration histogram and ruleCount counter for rules" 2013-12-11 22:52:19 +01:00
Stuart Nelson b75adfebad add evalDuration histogram and ruleCount counter for rules
Change-Id: I3508fe72526348d96b8158828388c3ac8d7c3fa9
2013-12-11 15:42:53 -05:00
Julius Volz 77a79d1fc0 Display filename when encountering bad rule file.
Change-Id: I4729371be92c5659a6938145c5fde66771d7be22
2013-12-11 15:44:11 +01:00
Julius Volz fb44580110 Cleanup/fix program termination sequence.
Change-Id: I2bc58a2583fb079c9ef383cfc7a5e0fbe613f1cd
2013-12-11 15:40:32 +01:00
Julius Volz 740d448983 Use custom timestamp type for sample timestamps and related code.
So far we've been using Go's native time.Time for anything related to sample
timestamps. Since the range of time.Time is much bigger than what we need, this
has created two problems:

- there could be time.Time values which were out of the range/precision of the
  time type that we persist to disk, therefore causing incorrectly ordered keys.
  One bug caused by this was:

  https://github.com/prometheus/prometheus/issues/367

  It would be good to use a timestamp type that's more closely aligned with
  what the underlying storage supports.

- sizeof(time.Time) is 192, while Prometheus should be ok with a single 64-bit
  Unix timestamp (possibly even a 32-bit one). Since we store samples in large
  numbers, this seriously affects memory usage. Furthermore, copying/working
  with the data will be faster if it's smaller.

*MEMORY USAGE RESULTS*
Initial memory usage comparisons for a running Prometheus with 1 timeseries and
100,000 samples show roughly a 13% decrease in total (VIRT) memory usage. In my
tests, this advantage for some reason decreased a bit the more samples the
timeseries had (to 5-7% for millions of samples). This I can't fully explain,
but perhaps garbage collection issues were involved.

*WHEN TO USE THE NEW TIMESTAMP TYPE*
The new clientmodel.Timestamp type should be used whenever time
calculations are either directly or indirectly related to sample
timestamps.

For example:
- the timestamp of a sample itself
- all kinds of watermarks
- anything that may become or is compared to a sample timestamp (like the timestamp
  passed into Target.Scrape()).

When to still use time.Time:
- for measuring durations/times not related to sample timestamps, like duration
  telemetry exporting, timers that indicate how frequently to execute some
  action, etc.

*NOTE ON OPERATOR OPTIMIZATION TESTS*
We don't use operator optimization code anymore, but it still lives in
the code as dead code. It still has tests, but I couldn't get all of them to
pass with the new timestamp format. I commented out the failing cases for now,
but we should probably remove the dead code soon. I just didn't want to do that
in the same change as this.

Change-Id: I821787414b0debe85c9fffaeb57abd453727af0f
2013-12-03 09:11:28 +01:00
Julius Volz c7daedc840 Merge "Add scalar() function." 2013-10-16 15:49:54 +02:00
Julius Volz be8024e18c Add scalar() function.
Change-Id: I1d1183e926a18fc98c9e94bbb9a808a3fb313102
2013-09-17 15:01:16 +02:00
Julius Volz 93a8d03221 Merge "Add alert-expression console links to notifications." 2013-08-24 19:40:50 +02:00
Julius Volz 1eb1ceac8c Add alert-expression console links to notifications.
The ConsoleLinkForExpression() function now escapes console URLs in such a way
that works both in emails and in HTML.

Change-Id: I917bae0b526cbbac28ccd2a4ec3c5ac03ee4c647
2013-08-20 15:45:41 +02:00
Matt T. Proud 7db518d3a0 Abstract high watermark cache into standard LRU.
Conflicts:
	storage/metric/memory.go
	storage/metric/tiered.go
	storage/metric/watermark.go

Change-Id: Iab2aedbd8f83dc4ce633421bd4a55990fa026b85
2013-08-19 12:26:55 +02:00
Julius Volz 0003027dce Add needed trailing spaces in logs. 2013-08-12 18:22:48 +02:00
Julius Volz aa5d251f8d Use github.com/golang/glog for all logging. 2013-08-12 17:54:36 +02:00
Julius Volz 3b970c5133 Add variable interpolation to notification messages.
This includes required refactorings to enable replacing the http client (for
testing) and moving the NotificationReq type definitions to the "notifications"
package, so that this package doesn't need to depend on "rules" anymore and
that it can instead use a representation of the required data which only
includes the necessary fields.
2013-08-12 12:29:08 +02:00
Julius Volz 35ee2cd3cb Add alertmanager notification support to Prometheus.
Alert definitions now also have mandatory SUMMARY and DESCRIPTION fields
that get sent along a firing alert to the alert manager.
2013-07-30 17:23:41 +02:00
Julius Volz 81f0b85013 Return [] instead of null for empty result vectors. 2013-07-25 12:16:32 +02:00
Julius Volz 64b0ade171 Swap rules lexer for much faster one.
This swaps github.com/kivikakk/golex for github.com/cznic/golex.

The old lexer would have taken 3.5 years to load a set of 5000 test rules
(quadratic time complexity for input length), whereas this one takes only 32ms.
Furthermore, since the new lexer is embedded differently, this gets rid of the
global parser variables and makes the rule loader fully reentrant without a
lock.
2013-07-11 19:35:29 +02:00
Julius Volz d2da21121c Implement getValueRangeAtIntervalOp for faster range queries.
This also short-circuits optimize() for now, since it is complex to implement
for the new operator, and ops generated by the query layer already fulfill the
needed invariants. We should still investigate later whether to completely
delete operator optimization code or extend it to support
getValueRangeAtIntervalOp operators.
2013-06-26 18:10:36 +02:00
Matt T. Proud 30b1cf80b5 WIP - Snapshot of Moving to Client Model. 2013-06-25 15:52:42 +02:00
Julius Volz 8ee7947b1e Ensure metric name is dropped correctly from alert labels in UI. 2013-06-14 13:03:19 +02:00
Julius Volz 0226d1ac7a Implement alerts dashboard and expression console links. 2013-06-13 22:35:40 +02:00
Julius Volz ba29d07901 Show loaded rules in Status dashboard. 2013-06-11 11:39:31 +02:00
Julius Volz fc97e688c6 Improve printing of rules and expressions. 2013-06-11 11:39:31 +02:00
Julius Volz 74cb676537 Implement Stringer interface for rules and all their children. 2013-06-07 15:54:32 +02:00
Matt T. Proud 2c3df44af6 Ensure database access waits until it is started.
This commit introduces a channel message to ensure serving
state has been reached with the storage stack before anything attempts
to use it.
2013-06-06 10:42:21 +02:00
Julius Volz 51689d965d Add debug timers to instant and range queries.
This adds timers around several query-relevant code blocks. For now, the
query timer stats are only logged for queries initiated through the UI.
In other cases (rule evaluations), the stats are simply thrown away.

My hope is that this helps us understand where queries spend time,
especially in cases where they sometimes hang for unusual amounts of
time.
2013-06-05 18:32:54 +02:00
Julius Volz adb87816f4 Put RuleManager concurrency in hands of caller, fix races. 2013-06-05 13:56:56 +02:00
Julius Volz 138334fb31 Fix handling of negative deltas for non-counter values. 2013-05-28 17:36:53 +02:00
Julius Volz 66d4620061 Don't assume delta has at least one sample per vector element. 2013-05-28 14:02:36 +02:00
Julius Volz 21c3be0814 Skip any empty range/boundary elements, not only nil ones. 2013-05-28 14:02:08 +02:00
Matt T. Proud c10780c966 Introduce telemetry for rule evaluator durations.
This commit adds telemetry for the Prometheus expression rule
evaluator, which will enable meta-Prometheus monitoring of customers
to ensure that no instance is falling behind in answering routine
queries.

A few other sundry simplifications are introduced, too.
2013-05-23 21:29:27 +02:00
Julius Volz 750f862d9a Use GetBoundaryValues() for non-counter deltas. 2013-05-22 19:13:47 +02:00
Julius Volz 5b105c77fc Repointerize fingerprints. 2013-05-21 14:28:14 +02:00
Matt T. Proud 8f4c7ece92 Destroy naked returns in half of corpus.
The use of naked return values is frowned upon.  This is the first
of two bulk updates to remove them.
2013-05-16 10:53:25 +03:00
juliusv 516101f015 Merge pull request #250 from prometheus/refactor/drop-unused-storage-setting
Drop unused writeMemoryInterval
2013-05-14 08:45:59 -07:00
juliusv 9ff00b651d Merge pull request #251 from prometheus/fix/memory-metric-mutability
Fix GetMetricForFingerprint() metric mutability.
2013-05-14 08:12:45 -07:00
Bernerd Schaefer 63d9988b9c Drop unused writeMemoryInterval 2013-05-14 17:03:03 +02:00
Bernerd Schaefer aa96c7d141 Fix rules_test.go
This is smelly, but for now we copy a helper method from the metric
tests into rules.
2013-05-14 16:55:18 +02:00
Julius Volz 83c60ad43a Fix GetMetricForFingerprint() metric mutability.
Some users of GetMetricForFingerprint() end up modifying the returned metric
labelset. Since the memory storage's implementation of
GetMetricForFingerprint() returned a pointer to the metric (and maps are
reference types anyways), the external mutation propagated back into the memory
storage.

The fix is to make a copy of the metric before returning it.
2013-05-14 16:46:30 +02:00
Bernerd Schaefer 428d91c86f Rename test helper files to helpers_test.go
This ensures that these files are properly included only in testing.
2013-05-14 16:30:47 +02:00
Matt T. Proud 244a4a9cdb Update to go1.1.
This commit updates the documentation, Makefiles, formatting, and
code semantics to support the 1.1. runtime, which includes ...

1. ``make advice``,

2. ``make format``, and

3. ``go fix`` on various targets.
2013-05-14 12:39:08 +02:00
Matt T. Proud 161c8fbf9b Include deletion processor for long-tail values.
This commit extracts the model.Values truncation behavior into the actual
tiered storage, which uses it and behaves in a peculiar way—notably the
retention of previous elements if the chunk were to ever go empty.  This is
done to enable interpolation between sparse sample values in the evaluation
cycle.  Nothing necessarily new here—just an extraction.

Now, the model.Values TruncateBefore functionality would do what a user
would expect without any surprises, which is required for the
DeletionProcessor, which may decide to split a large chunk in two if it
determines that the chunk contains the cut-off time.
2013-05-10 12:19:12 +02:00
Julius Volz 0877680761 Implement a COUNT ... BY aggregation operator.
This also removes the now obsolete scalar count() function and corrects the
expressions test naming (broken in
2202cd71c9 (L6R59))
so that the expression tests will actually run.
2013-05-08 16:35:16 +02:00
Julius Volz 56324d8ce2 Make AST query storage non-global. 2013-05-07 13:15:10 +02:00
Matt T. Proud ce45787dbf Storage interface to TieredStorage.
This commit drops the Storage interface and just replaces it with a
publicized TieredStorage type.  Storage had been anticipated to be
used as a wrapper for testability but just was not used due to
practicality.  Merely overengineered.  My bad.  Anyway, we will
eventually instantiate the TieredStorage dependencies in main.go and
pass them in for more intelligent lifecycle management.

These changes will pave the way for managing the curators without
Law of Demeter violations.
2013-05-03 15:54:14 +02:00
Julius Volz 9cea5d9df8 Convert the Prometheus configuration to protocol buffers. 2013-04-30 22:26:00 +02:00
Julius Volz d8110fcd9c Send sample arrays instead of single samples over channels. 2013-04-29 17:24:17 +02:00
Julius Volz dcf2e82752 Cleanup and idiomaticize rule/expression dot graph output. 2013-04-29 12:57:34 +02:00
Matt T. Proud b3e34c6658 Implement batch database sample curator.
This commit introduces to Prometheus a batch database sample curator,
which corroborates the high watermarks for sample series against the
curation watermark table to see whether a curator of a given type
needs to be run.

The curator is an abstract executor, which runs various curation
strategies across the database.  It remarks the progress for each
type of curation processor that runs for a given sample series.

A curation procesor is responsible for effectuating the underlying
batch changes that are request.  In this commit, we introduce the
CompactionProcessor, which takes several bits of runtime metadata and
combine sparse sample entries in the database together to form larger
groups.  For instance, for a given series it would be possible to
have the curator effectuate the following grouping:

- Samples Older than Two Weeks: Grouped into Bunches of 10000
- Samples Older than One Week: Grouped into Bunches of 1000
- Samples Older than One Day: Grouped into Bunches of 100
- Samples Older than One Hour: Grouped into Bunches of 10

The benefits hereof of such a compaction are 1. a smaller search
space in the database keyspace, 2. better employment of compression
for repetious values, and 3. reduced seek times.
2013-04-27 17:38:18 +02:00
Julius Volz 2202cd71c9 Track alerts over time and write out alert timeseries. 2013-04-26 14:35:21 +02:00
Julius Volz c0601abf46 Implement initial no-op alert parsing and rule parsing tests. 2013-04-23 13:48:24 +02:00
Matt T. Proud f9e99bd08a Refresh SampleValue to 64-bit floating point.
We always knew that this needed to be fixed.
2013-04-21 20:31:50 +02:00
Julius Volz 99dcbe0f94 Integrate memory and disk layers in view rendering. 2013-04-19 16:01:27 +02:00
Julius Volz 63625bd244 Make view use memory persistence, remove obsolete code.
This makes the memory persistence the backing store for views and
adjusts the MetricPersistence interface accordingly. It also removes
unused Get* method implementations from the LevelDB persistence so they
don't need to be adapted to the new interface. In the future, we should
rethink these interfaces.

All staleness and interpolation handling is now removed from the storage
layer and will be handled only by the query layer in the future.
2013-04-18 22:26:29 +02:00
Julius Volz 1eb586db7d Fix rule evaluation closure. 2013-04-17 15:11:21 +02:00
Julius Volz 5f5ea03105 Run "make format". 2013-04-16 17:23:59 +02:00
Julius Volz 1cff4f3d91 Fix rate() per-second adjustment.
This got broken during the depointerization of the Vector type.
2013-04-15 14:41:34 +02:00
juliusv 62f33f1fc2 Merge pull request #138 from prometheus/julius-fix-aliasing
Correct delta()/rate() intervals and temporal aliasing.
2013-04-15 05:38:48 -07:00
Matt T. Proud 167504efd6 Merge pull request #142 from prometheus/julius-lowercase-by
Allow lower-case BY operator.
2013-04-15 05:13:35 -07:00
Julius Volz d53b8cf956 Correct delta()/rate() intervals and temporal aliasing. 2013-04-15 12:30:46 +02:00
Julius Volz 000f6a2e23 Allow lower-case BY operator. 2013-04-15 11:56:23 +02:00
Julius Volz a0d311c9e6 Constantize job name label. 2013-04-15 11:47:54 +02:00
Julius Volz 1bc83e1b65 Also allow lower-cased aggregation ops. 2013-04-11 18:25:22 +02:00
juliusv f9c291120f Merge pull request #123 from prometheus/julius-propagate-rule-errors
Propagate more errors during rule evaluation.
2013-04-11 06:38:33 -07:00
Julius Volz 9a81b9838f Make expression parser goroutine-safe.
See https://github.com/prometheus/prometheus/issues/127
2013-04-10 19:17:28 +02:00
Julius Volz 6cb3c51d24 Add sort() and sort_desc() expression language functions. 2013-04-10 18:05:45 +02:00
Julius Volz c4d0969c00 Propagate more errors during rule evaluation. 2013-04-09 13:47:20 +02:00
Julius Volz e31591e6fe Allow single-letter identifiers (metric and label names). 2013-03-28 18:37:54 +01:00
Julius Volz ec413459fa Depointerize Matrix/Vector types as well as time.Time arguments. 2013-03-28 18:07:12 +01:00
Julius Volz 676845afaf Implement sample interpolation in query layer. 2013-03-28 16:41:51 +01:00
Matt T. Proud c53a72a894 Test data for the curator. 2013-03-27 18:13:43 +01:00
Julius Volz b836066c71 Eliminate need to get fingerprints during query execution time. 2013-03-27 14:42:03 +01:00
Julius Volz 55ca65aa6e More userfriendly output when we fail to create the tiered storage. 2013-03-27 11:25:05 +01:00
Matt T. Proud c4e971d7d9 Merge pull request #101 from prometheus/refactor/test/directory-extraction
Create temporary directory handler.
2013-03-26 10:46:28 -07:00
Matt T. Proud b86b0ea41a Create temporary directory handler. 2013-03-26 18:09:25 +01:00
Julius Volz 2b8f0b2cc7 Constantize metric name label name. 2013-03-26 16:20:23 +01:00
Julius Volz 3880a86c9c In case of empty query results, return an empty matrix. 2013-03-25 12:14:48 +01:00
Julius Volz 8e4c5b0cea Use AST query analyzer and views with tiered storage. 2013-03-21 18:16:52 +01:00
Julius Volz 2f814d0e6d AST persistence adapter simplifications after storage changes. 2013-03-21 18:11:03 +01:00
Julius Volz 6001d22f87 Change Get* methods to receive fingerprints instead of metrics. 2013-03-21 18:11:03 +01:00
Matt T. Proud 5959cd9e53 Include Julius' feedback. 2013-03-21 18:08:48 +01:00
Matt T. Proud a70ee43ad3 Niladic `ToString() to idiomatic String()`. 2013-03-21 18:08:47 +01:00
Matt T. Proud 41068c2e84 Checkpoint. 2013-03-21 18:06:51 +01:00
Matt T. Proud 13ae29b304 Initial in-memory arena implementation.
It is unbounded, and nothing uses it except for a gating flag in main.
2013-02-18 09:38:14 -06:00
Julius Volz c3d31febd6 Move durationToString to common place and cleanup error handling. 2013-02-14 19:02:23 +01:00
Matt T. Proud efbe0e8a12 Interface simplification.
GetMetricForFingerprint(model.Fingerprint) (*Metric, error) ->
GetMetricForFingerprint(model.Fingerprint) (Metric, error)
2013-02-14 08:43:02 -08:00
Matt T. Proud e8a733b525 Interface simplifications.
GetFingerprintsForLabelSet ([]*Fingerprint, error) ->
GetFingerprintsForLabelSet ([]Fingerprint, error)
2013-02-14 08:07:59 -08:00
Matt T. Proud f03091b139 Interface simplifications: GetRangeValues
From pointers to copies.
2013-02-13 21:11:23 -08:00
Matt T. Proud 56f069b3ec Interface simplifications: GetValueAtTime().
Pointer arguments to copies.
2013-02-13 21:05:01 -08:00
Matt T. Proud 900bb988c1 Simplifications of GetFingerprintsForLabelSet.
``MetricPersistence.GetFingerprintsForLabelSet(s *model.LabelSet)`` ->
``MetricPersistence.GetFingerprintsForLabelSet(s model.LabelSet)``.
2013-02-13 17:13:41 -08:00
Matt T. Proud 4fbcea73f5 MetricPersistence.AppendSample signature changes.
``MetricPersistence.AppendSample(*model.Sample)`` -> ``MetricPersistence.AppendSample(model.Sample)``.
2013-02-13 13:46:28 -08:00
Julius Volz 06ace4941d Remove/replace last references to github.com/matttproud/... 2013-02-07 14:32:18 +01:00
Julius Volz 16d9dcd6a8 Add copyright notices to all remaining files. 2013-02-07 11:49:04 +01:00
Julius Volz d67e4b9131 Address outstanding comments from PR/47 and other cleanups. 2013-02-07 11:38:01 +01:00
Matt T. Proud ea54751431 Update import paths to new location.
This repository moved from matttproud/prometheus to
prometheus/prometheus, and all import paths need to be updated.
2013-01-27 18:49:45 +01:00
Julius Volz c049ae39af Cleanups to rules/persistence adapter code. 2013-01-25 12:22:55 +01:00
juliusv 619aa97025 Only close rule file if it could be opened. 2013-01-25 03:32:46 +01:00
Julius Volz a85204a0a4 Add support for matrix duration strings without quotes. 2013-01-22 02:27:26 +01:00
Julius Volz 1760d927c8 Add error propagation to web UI via special JSON error type. 2013-01-22 02:27:26 +01:00
Julius Volz 49c87348b5 Implement per-second rate behavior for rate(). 2013-01-22 02:27:26 +01:00
Julius Volz 93670aa129 Return API errors in JSON format. 2013-01-22 02:27:26 +01:00
Julius Volz a20bf35997 Fix whitespace with "make format". 2013-01-22 02:27:26 +01:00
Julius Volz c21450a089 Use correct label name for metric name in rule. 2013-01-22 02:27:26 +01:00
Julius Volz 6929c10acf Add case-statement for OR, which still needs to be implemented. 2013-01-22 02:27:26 +01:00
Julius Volz a555ded2b3 Add "w" (weeks) as a valid timeunit. 2013-01-22 02:27:26 +01:00
Julius Volz 2c8595f96e First graphing support. 2013-01-22 02:27:26 +01:00
Matt T. Proud efe61c18fa Refactor target scheduling to separate facility.
``Target`` will be refactored down the road to support various
nuanced endpoint types.  Thusly incorporating the scheduling
behavior within it will be problematic.  To that end, the scheduling
behavior has been moved into a separate assistance type to improve
conciseness and testability.

``make format`` was also run.
2013-01-13 10:43:37 +01:00
Julius Volz cb6eb30182 Fix state cleanup bug between rule/config parser runs.
This fixes a bug that has been annoying me minorly for some time now:
sometimes, after parse errors, a subsequent parser run would fail.  The reason
is that yylex() modifies some global variables (yytext, yydata) during its run
to keep state. To make subsequent parser runs correct, these have to be reset
before each run.

Also, close files after reading them.
2013-01-12 02:35:40 +01:00
Julius Volz 17a4a442b3 Add REST API, expression browser, and text/JSON output formats. 2013-01-11 02:27:03 +01:00
Julius Volz 06162180ad Implement matrix range and boundary fetching from metrics store. 2013-01-11 01:19:27 +01:00
Julius Volz 483bd81a44 Allow grammar to parse both rules and single expressions. 2013-01-11 01:17:37 +01:00
Julius Volz c52b959fda Fix matrix interval time calculation. 2013-01-11 01:12:34 +01:00
Julius Volz fdf9a3aab7 Fix node type checks in arithmetic expressions. 2013-01-11 01:09:31 +01:00
Julius Volz c4a2358551 Set correct interval in MatrixLiteral.Eval(). 2013-01-11 01:08:47 +01:00
Julius Volz 429b66019c Exclude metric name in vector arithmetric label matching. 2013-01-11 01:07:48 +01:00
Julius Volz 56384bf42a Add initial config and rule language implementation. 2013-01-07 23:43:36 +01:00