Commit graph

13975 commits

Author SHA1 Message Date
Julius Volz c582ae73c2 Implement topk() and bottomk() functions.
To achieve O(log n * k) runtime, this uses a heap to track the current
bottom-k or top-k elements while iterating over the full set of
available elements.

It would be possible to reuse more code between topk and bottomk, but I
decided for some more duplication for the sake of clarity.

This fixes https://github.com/prometheus/prometheus/issues/399

Change-Id: I7487ddaadbe7acb22ca2cf2283ba6e7915f2b336
2014-11-25 17:02:00 +01:00
Bjoern Rabenstein 1909686789 Make metrics exported by the Prometheus server itself more consistent.
- Always spell out the time unit (e.g. milliseconds instead of ms).

- Remove "_total" from the names of metrics that are not counters.

- Make use of the "Namespace" and "Subsystem" fields in the options.

- Removed the "capacity" facet from all metrics about channels/queues.
  These are all fixed via command line flags and will never change
  during the runtime of a process. Also, they should not be part of
  the same metric family. I have added separate metrics for the
  capacity of queues as convenience. (They will never change and are
  only set once.)

- I left "metric_disk_latency_microseconds" unchanged, although that
  metric measures the latency of the storage device, even if it is not
  a spinning disk. "SSD" is read by many as "solid state disk", so
  it's not too far off. (It should be "solid state drive", of course,
  but "metric_drive_latency_microseconds" is probably confusing.)

- Brian suggested to not mix "failure" and "success" outcome in the
  same metric family (distinguished by labels). For now, I left it as
  it is. We are touching some bigger issue here, especially as other
  parts in the Prometheus ecosystem are following the same
  principle. We still need to come to terms here and then change
  things consistently everywhere.

Change-Id: If799458b450d18f78500f05990301c12525197d3
2014-11-25 17:02:00 +01:00
Brian Brazil 4a2b96f848 Remove backoff on scrape failure.
Having metrics with variable timestamps inconsistently
spaced when things fail will make it harder to write correct rules.

Update status page, requires some refactoring to insert a function.

Change-Id: Ie1c586cca53b8f3b318af8c21c418873063738a8
2014-11-25 17:02:00 +01:00
Julius Volz 00b9489f1c Fix time() behavior.
time() should return the timestamp for which the query is executed, not
the actual current time.

Change-Id: I430a45cabad7785cd58f95b1028a71dff4c87710
2014-11-25 17:02:00 +01:00
Julius Volz c5984f1818 Add abs() and over-time aggregation functions.
This implements aggregation functions over time as request in
https://github.com/prometheus/prometheus/issues/383.

Change-Id: Ifd69b850de8cfdf6e7a6c0e042056fa4c672410e
2014-11-25 17:02:00 +01:00
Julius Volz 1bb7074fec Fix HTTP connection leak upon non-OK status.
Change-Id: Ie7fbd7dcc089b8306b40631be3e3d736c23c1cd3
2014-11-25 17:02:00 +01:00
Brian Brazil 144d5bb9fd Add 'tmpl', a 'template' for non-string literal names.
Change-Id: I6a03a5c5d20029cf414562efa7745ed6c53b2731
2014-11-25 17:02:00 +01:00
Brian Brazil f525ca5d9e Let consoles get graph links from experssions.
Rename ConsoleLinkFromExpression, as we now have consoles.

Change-Id: I7ed2c9c83863adb390b51121dd9736845f7bcdfc
2014-11-25 17:01:59 +01:00
Brian Brazil eba205fcac Expose path used to get to console to console.
Change-Id: I72386a2d4e53863da302ecc5c7e44d6c310197e0
2014-11-25 17:01:59 +01:00
Johannes 'fish' Ziemke aed1d384a9 Build prometheus tools as well
Change-Id: I49d5ca4d6ff715e8a6631caf052de309b91b0b1b
2014-11-25 17:01:59 +01:00
Brian Brazil eb5d928da7 Fix console handler.
This was accidnetally broken in 2128d9d811.

Change-Id: I50ea1fdb8ae4d28ae4555410bee97e5037692aa5
2014-11-25 17:01:59 +01:00
Bjoern Rabenstein bacc31d5cc Remove work-around that required copying all bytes of a scrape.
Now that the subtle bug in matttproud/golang_protobuf_extensions is
fixed, we do not need to copy the bytes of a scrape into a buffer
first before starting to parse it.

Change-Id: Ib73ecae16173ddd219cda56388a8f853332f8853
2014-11-25 17:01:59 +01:00
Julius Volz 74de633a3a Prometheus version 0.6.0.
Change-Id: I50f6b69cca952eedf9a62b9a8f58e0fb633a83ed
2014-11-25 17:01:59 +01:00
Julius Volz 80b3d3bf34 Speed up disk flushes by removing unnecessary sort.
The first sort in groupByFingerprint already ensures that all resulting sample
lists contain only one fingerprint. We also already assume that all
samples passed into AppendSamples (and thus groupByFingerprint) are
chronologically sorted within each fingerprint.

The extra chronological sort is thus superfluous. Furthermore, this
second sort didn't only sort chronologically, but also compared all
metric fingerprints again (although we already know that we're only
sorting within samples for the same fingerprint). This caused a huge
memory and runtime overhead.

In a heavily loaded real Prometheus, this brought down disk flush times
from ~9 minutes to ~1 minute.

OLD:
BenchmarkLevelDBAppendRepeatingValues   5  331391808 ns/op  44542953 B/op   597788 allocs/op
BenchmarkLevelDBAppendsRepeatingValues  5  329893512 ns/op  46968288 B/op  3104373 allocs/op

NEW:
BenchmarkLevelDBAppendRepeatingValues   5  299298635 ns/op  43329497 B/op   567616 allocs/op
BenchmarkLevelDBAppendsRepeatingValues 20   92204601 ns/op   1779454 B/op    70975 allocs/op

Change-Id: Ie2d8db3569b0102a18010f9e106e391fda7f7883
2014-11-25 17:01:59 +01:00
Julius Volz 21cafe6cd7 Only evict memory series after they are on disk.
This fixes the problem where samples become temporarily unavailable for
queries while they are being flushed to disk. Although the entire
flushing code could use some major refactoring, I'm explicitly trying to
do the minimal change to fix the problem since there's a whole new
storage implementation in the pipeline.

Change-Id: I0f5393a30b88654c73567456aeaea62f8b3756d9
2014-11-25 17:01:59 +01:00
Bjoern Rabenstein 8956faeccb Migrate to new client_golang.
This change will only be submitted when the new client_golang has been
moved to the new version.

Change-Id: Ifceb59333072a08286a8ac910709a8ba2e3a1581
2014-11-25 17:01:59 +01:00
Bjoern Rabenstein 814e479723 Treat non-200 HTTP response as error.
Change-Id: I2a9f3b47012b3c4839be53aa44c66d16dd41a24a
2014-11-25 17:01:59 +01:00
Brian Brazil e27447da5c Remove the broken "User Dashboard" link.
Due to the lack of a </a>, this makes the entire header render badly.
Accordingly it's safe to assume noone is using it, so remove it.
With the new console template support, we'll need to something a bit
more nuanced later.

Change-Id: I3424bed6aea18cbd4c63ad48f98808098dadc3ad
2014-11-25 17:01:59 +01:00
Brian Brazil 2f76f434a5 Add humanizeDuration function.
This attempts to reasonably handle things from weekly cronjobs,
to rpcs taking ns to things that are usually ms but jump to over a second.

For consistency, stop putting spaces before prefixes.

Change-Id: I6407879187b25680b323cd70254e205315b5fc3c
2014-11-25 17:01:59 +01:00
Brian Brazil 960ede66dc Use html/template for console templates and add template libary support.
Add a function to bypass the new auto-escaping.
Add a function to workaround go's templates only allowing passing in one argument.

Change-Id: Id7aa3f95e7c227692dc22108388b1d9b1e2eec99
2014-11-25 17:01:59 +01:00
Brian Brazil 0f5874ff97 Make Prometheus in header link to status page.
This is consistent with alertmanager, and more intiutive for users.

The graphs page just has graphs, so remove mention of consoles.

Change-Id: I87780a4ade33697a6095423e1a7de47d341d2838
2014-11-25 17:01:59 +01:00
Brian Brazil cd3592aebc Add title and match functions.
Change-Id: Ifd376c2935e22d378e7afa06122642847a237d78
2014-11-25 17:01:59 +01:00
Brian Brazil 1828b1f55c Only log every query when debugging.
Change-Id: I4f988d81cda6f6deb0ed7f497de4aa75409b158f
2014-11-25 17:01:59 +01:00
Brian Brazil 9b74324d9e Add functions for regex replacement, sorting and humanizing.
Change-Id: I471c7a8087cd5432b51afce811b591b11583a0c3
2014-11-25 17:01:59 +01:00
Julius Volz 00fd10e24f Update GeneratorURL field name in notification tests.
Change-Id: Ic4357999b6ebcf54008869a395e56d12a0ead211
2014-11-20 18:10:43 +01:00
Julius Volz 459f551259 Merge "Eliminate modal alerts in graphing UI." 2014-10-30 17:00:57 +01:00
Julius Volz 0da8b2add1 Make tabular view the default (vs. graphing view).
Change-Id: I9f0961f2c474e8cce5e376ce4e20040644f89370
2014-10-30 16:38:25 +01:00
Julius Volz 921ebbf744 Eliminate modal alerts in graphing UI.
This shows errors in a pane under the expression input instead.

Change-Id: Iec209e1628a3b102cce9f34b2467621772dfb8ff
2014-10-30 16:18:05 +01:00
Julius Volz 2c4cab07b1 Fix acronym caps in GeneratorURL.
Change-Id: Ib18c1f617dcde1039e848059545a6d8831d9bf66
2014-10-27 17:03:00 +01:00
Julius Volz f1aac54104 Allow alternative "by"-clause position in grammar.
In addition to the existing by-clause syntax:

  sum(<expression>) by (<labels>) [keeping_extra]

...this allows the following new syntax:

  sum by (<labels>) [keeping_extra] (<expression>)

Both orderings may be used in a single expression. It is up to the users
to establish guidelines around their usage.

Change-Id: Iba10c9cc5fb6ac62edfcf246d281473e82467992
2014-10-22 11:57:20 +02:00
Brian Brazil 2aa8c8669e Make query_range more robust.
Gracefully handle decimal values, by truncating them.

Limit amount of steps, to avoid accidentally pulling too much data.
This limit returns up to ~500kB per timeseries, and allows
for 60s granularity for a week and 1h granularity for a year.

Change-Id: Ie549fc24deb2eecbc6c5d1b6088a548a6b02e849
2014-10-20 18:39:46 +01:00
Brian Brazil 50a995c8de Don't alert() when a query is aborted,
such as when you change the range.

Change-Id: I574504f97446ac5f3dda737fe054ae83f17dbbc2
2014-10-15 15:38:09 +01:00
Julius Volz 080b952647 Allow omitting the metric name in queries.
This allows the following expression syntaxes for selecting timeseries:

  foo                    (already valid before)
  foo{}                  (already valid before)
  {job="prometheus"}     (new, select all timeseries for job "prometheus")

Omitting both the metric name *and* any label matchers ("" or "{}") will
still yield a syntax error.

To get all timeseries, you could do:

    {__name__=~".*"}
       or, without relying on knowledge about __metric__:
    {job=~".*"}

Change-Id: Ifee000b9ac0184ef6ced18411069c7f2699a2dda
2014-10-14 17:43:37 +02:00
Brian Brazil 35fb5378bc Add back consoles link.
Goes in index.html in consoles or else user data, if present.

Change-Id: I5303d30aa24ca0c20d2e0f49121e04a260b9c4f4
2014-10-02 15:44:47 +01:00
Andres Suarez dba246e97a Focus expression after selection from dropdown
Change-Id: Id7f67e558e3611ab4c7188cc428c342d8d3e67db
2014-09-16 19:02:01 +02:00
Andres Suarez 76527bae8b Allow selecting metric from Insert Metric
Change-Id: I99e0539cab2749a8aeabc0a13015889ff45834f7
2014-09-16 19:01:14 +02:00
Bjoern Rabenstein de337e6404 Cut v0.8.0.
Change-Id: Ie8d49793e78f10bdeb7ebe19cc2dc729ff7ef590
2014-09-04 15:41:13 +02:00
Bjoern Rabenstein 943a939c29 Fix the accept header.
A '/' is a separator and has to be in a quoted string.

Change-Id: If7a3a847f84f8f709074d05dc98b5b21e954030c
2014-09-03 16:46:29 +02:00
Julius Volz f739980dfe Format changelog properly.
Change-Id: I62c5bf8c5b880272d207da564a3fc45490c5db5e
2014-08-25 15:14:25 +02:00
Julius Volz e995cda75c Merge "Stagger scrapes to spread out load." 2014-08-20 18:13:19 +02:00
Brian Brazil 3b3ec604c3 Stagger scrapes to spread out load.
Change-Id: Ib141b271e4adfb817886871f86051c207b05cf35
2014-08-20 17:07:10 +01:00
Julius Volz 0ca5be127f Prometheus version 0.7.0.
Change-Id: I73468f72b43654f4bf57627c2f49fe802b18f637
2014-08-06 14:13:17 +02:00
Julius Volz bfb64321de Merge "Update used Go version to 1.3." 2014-08-06 13:10:09 +02:00
Julius Volz ef3b512dcf Update used Go version to 1.3.
Go downloads moved to a different URL and require following redirects
(curl's '-L' option) now.

Go 1.3 deliberately randomizes ranges over maps, which uncovered some
bugs in our tests. These are fixed too.

Change-Id: Id2d9e185d8d2379a9b7b8ad5ba680024565d15f4
2014-08-06 12:51:53 +02:00
Julius Volz b65c5dd752 Add function to drop common labels in a vector.
This fixes https://github.com/prometheus/prometheus/issues/384.

Change-Id: I2973c4baeb8a4618ec3875fb11c6fcf5d111784b
2014-08-05 20:43:52 +02:00
Julius Volz f7cd18abdf Add more topk() / bottomk() tests.
Test what happens if k > number of input elements.

Change-Id: Ie724b850939e297ebf085f0a5a3522e9cfcc6534
2014-08-05 20:14:04 +02:00
Julius Volz 200d02effe Implement topk() and bottomk() functions.
To achieve O(log n * k) runtime, this uses a heap to track the current
bottom-k or top-k elements while iterating over the full set of
available elements.

It would be possible to reuse more code between topk and bottomk, but I
decided for some more duplication for the sake of clarity.

This fixes https://github.com/prometheus/prometheus/issues/399

Change-Id: I7487ddaadbe7acb22ca2cf2283ba6e7915f2b336
2014-08-05 19:05:36 +02:00
Bjoern Rabenstein 24ece38f7c Make metrics exported by the Prometheus server itself more consistent.
- Always spell out the time unit (e.g. milliseconds instead of ms).

- Remove "_total" from the names of metrics that are not counters.

- Make use of the "Namespace" and "Subsystem" fields in the options.

- Removed the "capacity" facet from all metrics about channels/queues.
  These are all fixed via command line flags and will never change
  during the runtime of a process. Also, they should not be part of
  the same metric family. I have added separate metrics for the
  capacity of queues as convenience. (They will never change and are
  only set once.)

- I left "metric_disk_latency_microseconds" unchanged, although that
  metric measures the latency of the storage device, even if it is not
  a spinning disk. "SSD" is read by many as "solid state disk", so
  it's not too far off. (It should be "solid state drive", of course,
  but "metric_drive_latency_microseconds" is probably confusing.)

- Brian suggested to not mix "failure" and "success" outcome in the
  same metric family (distinguished by labels). For now, I left it as
  it is. We are touching some bigger issue here, especially as other
  parts in the Prometheus ecosystem are following the same
  principle. We still need to come to terms here and then change
  things consistently everywhere.

Change-Id: If799458b450d18f78500f05990301c12525197d3
2014-07-31 15:44:31 +02:00
Julius Volz ffa4a2935e Merge "Remove backoff on scrape failure." 2014-07-29 22:34:05 +02:00
Brian Brazil 3835b7507d Remove backoff on scrape failure.
Having metrics with variable timestamps inconsistently
spaced when things fail will make it harder to write correct rules.

Update status page, requires some refactoring to insert a function.

Change-Id: Ie1c586cca53b8f3b318af8c21c418873063738a8
2014-07-29 17:43:52 +01:00